Mikołaj Bojańczyk

This lecture is about learning regular languages (of finite words). All automata here are deterministic finite automata. The setup is that there are two parties: Learner and Teacher. Teacher knows a regular language. Learner wants to learn this language, and pursues this goal by asking two types of queries to the Teacher:

MembershipIn a membership query, Learner gives a word, and the Teacher says whether or not Teacher’s language contains that word.

• Equivalence. In an equivalence query, Learner gives regular language, represented by an automaton, and Teacher replies whether or not the Teacher’s and Learner’s languages are equal. If yes, the protocol is finished. If no, Teacher gives a counterexample, i.e. a word where the Teacher’s and Learner’s languages disagree.

Membership queries on their own can never be enough to ascertain the language – there are infinitely many regular languages that match any finite set of membership queries. On the other hand, given enough time, equivalence queries alone are sufficient:  Learner can enumerate all regular languages, and ask equivalence queries until the correct language is reached, without ever using membership queries. The lecture is about a more practical solution, which was found by Dana Angluin. Angluin’s algorithm is a protocol where Learner learns Teacher’s language in a number of queries that is polynomial in:

• the minimal automaton of Teacher’s language;
• the size of Teacher’s counterexamples.

If Teacher provides counterexamples of minimal size, then the second parameter above is superfluous, i.e. the number of queries will be polynomial in the minimal automaton of Teacher’s language. As mentioned above, we only talk about deterministic automata, and therefore the minimal automaton refers to the minimal deterministic automaton.

State words and test words.

Suppose that Teacher’s language is . We assume that the alphabet is known to both parties, but the language is only known to Teacher. At each step of the algorithm, Learner will store an approximation of the minimal automaton of , as described by two sets of words:

• a set of state words, closed under prefixes;
• a set of test words, closed under suffixes.

The idea is that the state words are all distinct with respect to Myhill-Nerode equivalence for Teacher’s language, and the test words are sufficient to prove this. This idea is formalised in the following definitions.

Correctness and completeness. If is a set of test words, we say that words  are -equivalent if

This is an equivalence relation, which is coarser or equal to the Myhill-Nerode equivalence relation of Teacher’s language. In terms of -equivalence we define the following properties of sets that will be used in the algorithm:

correctness: no two distinct words in are -equivalent;
completeness: for every and , there is some that is -equivalent to .

If is correct and complete, then we can define an automaton as follows. The states are , the initial state being the empty word. When the automaton is in state and reads a letter , it goes to the state described in the completeness property; this state is unique by the correctness property. The accepting states are those states that are in Teacher’s language.

Fact 1. If is correct but not complete, then using a polynomial number of membership queries, Learner can find some such that is correct and complete.

Proof. If and are such that no word in is -equivalent to , then can be added to . The membership queries are used to test what is -equivalent to .

The algorithm

We are now ready to describe the algorithm.

1.

2. Invariant: is correct, not necessarily complete.

3. Apply Fact 1, and enlarge , making  correct and complete.

4. Compute the automaton for and ask an equivalence query for it.

5. If the answer is yes, then the algorithm terminates with success.

6. If the answer is no, then add the counterexample and its suffixes to .

7. Goto 2.

Note that if is correct, then all words in correspond to different states in the minimal automaton (for Teacher’s language). Furthermore, if the size of reaches the size of the minimal automaton, then represents all states of the minimal automaton, and the transition function in the automaton for is the same as the transition function in the minimal automaton. Therefore, if  reaches the size of the minimal automaton, the equivalence query in step 4 has a positive result.

To prove that the algorithm terminates, we show in Fact 2 below that after step 6, is no longer complete. This will mean that step 3 will necessarily enlarge , and therefore the number of times we do “Goto 2” will be bounded by the size of the minimal automaton.

Fact 2.  After step 6, is no longer complete.

Proof. Let be the pair in step 4, and let be the counterexample, which witnesses that the automaton for does not recognise Teacher’s language. Define to be plus all suffixes of the counterexample, and suppose toward a contradiction that is complete. It is not difficult to see that the automata for and are the same. Define to be the state of either of these automata after reading .  By construction, the state is a word which is -equivalent to , and since , it follows that

Since is the empty word, the above and induction imply that

which means that the automaton gives the correct answer to the counterexample, a  contradiction. (Thank you Marcin Smulewicz for helping me with this proof!)