http://www.cs.princeton.edu/introcs/73fsa/
7.3. FINITE STATE AUTOMATA
This section under major construction.
Abstract machines. Modern computers are capable of performing a wide variety of computations. An
abstract machine reads in an input string, and, depending on the input, outputs true (
accept), outputs false (
reject), or gets stuck in an infinite loop and outputs nothing. We say that a machine
recognizes a particular language, if it outputs true for any input string in the language and false otherwise. The artificial restriction to such
decision problems is purely for notational convenience. Virtually all computational problems can be recast as language recognition problems. For example, to determine whether an integer 97 is prime, we can ask whether 91 is in the language consisting of all primes {2, 3, 5, 7, 13, ... } or to determine the decimal expansion of the mathematical constant π we can ask whether 7 is the 100th digit of π and so on.
We would like to be able to formally compare different classes of abstract machines in order to address questions like
Is a Mac more powerful than a PC? Can Java do more things than C++? To accomplish this, we define a notion of
power. We say that machine A is at least as
powerful as machine B if machine A can be "programmed'" to recognize all of the languages that B can. Machine A is more powerful than B, if in addition, it can be programmed to recognize at least one additional language. Two machines are
equivalent if they can be programmed to recognize precisely the same set of languages. Using this definition of power, we will classify several fundamental machines. Naturally, we are interested in designing the most powerful computer, i.e., the one that can solve the widest range of language recognition problems. Note that our notion of power does not say anything about how fast a computation can be done. (See
Chapter 10.) Instead, it reflects a more fundamental notion of whether or not it is even possible to perform some computation in a finite number of steps.
Finite state automata. A
deterministic finite state automaton (DFA) is, perhaps, the simplest type of machine that is still interesting to study. Many of its important properties carry over to more complicated machines. So, before we hope to understand these more complicated machines, we first study DFAs. However, it is an enormously useful practical abstraction because DFAs still retain sufficient flexibility to perform interesting tasks, yet the hardware requirements for building them are relatively minimal. DFAs are widely used in text editors for pattern matching, in compilers for lexical analysis, in web browsers for html parsing, and in operating systems for graphical user interfaces. They also serve as the control unit in many physical systems including: vending machines, elevators, automatic traffic signals, and computer microprocessors. They are play a key role in natural language processing and machine learning.
A DFA captures the basic elements of an abstract machine: it reads in a string, and depending on the input and the way the machine was designed, it outputs true or false. A DFA is always is one of N
states, which we name label 0 through N-1. Each state is labeled true or false. The DFA begins in a distinguished state called the
start state. As the input characters are read in one at a time, the DFA changes from one state to another in a prespecified way. The new state is completely determined by the current state and the character just read in. When the input is exhausted, the DFA outputs true or false according to the label of the state it is currently in.
Below is an example of a DFA that accepts binary strings that are multiples of 3. For example, the machine rejects 1101 since 1101 in binary is 13 in decimal, which is not divisible by 3. On the other hand, the machine accepts 1100 since it is 12 in decimal.
[התמונה הבאה מגיעה מקישור שלא מתחיל ב https ולכן לא הוטמעה בדף כדי לשמור על https תקין:
http://www.cs.princeton.edu/introcs/73fsa/fsa-div3.png]
See
traffic light DFA from COS 111.
String searching DFAs. One of the most important applications of DFAs is searching for patterns in strings. This is at the heart of Web search engines like Google. The following DFA gives a simplified example of such a tool over the binary alphabet. It accepts all string inputs that contain the pattern aabaaabb as a substring.
[התמונה הבאה מגיעה מקישור שלא מתחיל ב https ולכן לא הוטמעה בדף כדי לשמור על https תקין:
http://www.cs.princeton.edu/introcs/73fsa/fsa-aabaaabb.png]
The DFA works by advancing one state to the right each time it reads in a matching character. For example, after reading in aabaa, the DFA is in state 5. If the next symbol read is a, it advances to state 6. If the next symbol read is b, it does not go all the way back to state 0 because the search pattern might overlap some of the input characters read in so far. Instead it goes to state 3 because the last 3 characters read in (aab) are the first 3 characters of the search pattern.
The famous Knuth-Pratt-Moore algorithm efficiently solves the problem of searching for a pattern in a string using DFAs. It first preprocesses the search pattern and creates a DFA like the one above. Then, to determine whether the pattern appears in a search string, it uses the search string as input to the DFA. It is also possible to support the more general types of search queries known as
regular expressions using DFA technology. In fact, there is a striking duality between DFA's and regular expressions: regular expressions are precisely the types of patterns that DFA's can match.
DFA example. Build machine that recognizes valid email addresses of the form:
string0@string1.string2.string3.edu. Assume each string is composed of only lowercase letters. Use nondeterminism to implement version where each address ends with edu, com, or gov. This
Perl regular expression validates emails according to the grammar defined in the RFC822 standard. It's surprisingly complicated.
String processing. The program
CommentStripper.java reads in a Java (or C++) program from standard input, removes all comments, and prints the result to standard output. This would be useful as part of a Java compiler. It removes /* */ and // style comments using a 5 state finite state automaton. It is meant to illustrate the power of DFAs, but to properly strip Java comments, you would need a few more states to handle extra cases, e.g., quoted string literals like s = "/***//*". The picture below is courtesy of
David Eppstein.
[התמונה הבאה מגיעה מקישור שלא מתחיל ב https ולכן לא הוטמעה בדף כדי לשמור על https תקין:
http://www.cs.princeton.edu/introcs/73fsa/comment.gif]