1、Scanner,中正理工學院電算中心副教授許良全,Compiler Design,Overview of Scanning,The purpose of a scanner is to group input characters into tokens.A scanner is sometimes called a lexical analyzerA precise definition of tokens is necessary to ensure that lexical rules are properly enforced.Scanners normally seek to mak
2、e a token as long as possible. E.g. ABC is scanned as one identifier rather than threeAll scanners perform much the same functionusing scanner generator is to limit the effort in building a scanner from scratch,Compiler Design,Finite State Systems,The finite state automaton is a mathematical model o
3、f a system, with discrete input and outputs,Compiler Design,Examples of Finite State Systems,Elevatorsdo not remember all previous requests for service but only the current floor, the direction of motion, and the collection of not yet satisfied requests for serviceVending machinesinsert enough coins
4、 and youll get a Pepsi eventuallyComputersthe state of the CPU, main memory, and auxiliary storage at any time is one of a very large but finite number of statesHuman brains235 cells or neurons at most,Compiler Design,Definition of Finite Automata,A finite automaton (FA) is an idealized 5-tuple comp
5、uter that recognizes strings belonging to regular sets. (Q,q0,F)A finite set of states, QA finite input alphabet, , or vocabulary, V.A special start, or initial state, q0. q0Q.A set of final, or accepting states, F. FQ.A transition function, , that maps QF to Q.,Compiler Design,FA and Transition Dia
6、grams,Compiler Design,FA and Transition Tables,Compiler Design,Regular Expressions,The languages accepted by finite automata are easily described by simple expressions called regular expressions.Strings are built from characters in V via catenatione.g., !=, for, whileAn empty or null string, denoted
7、 by , is allowedThe characters, (, ), , *, +, and | are called meta-characters. They must be be quoted when used in order to avoid ambiguity. E.g.Delim = (|)|:=|;|,|+|-|*|/|=|$),Compiler Design,Definition of Regular Expression,A regular expression denotes a set of strings: is a regular expression de
8、noting the empty set (the set containing no strings). is a regular expression denoting the set that contains only the empty string.Note that this set contains one element.A string s is a regular expression denoting a set containing only s. If s contains meta-characters, s can be quoted to avoid ambi
9、guity.If A and B are regular expressions, then A|B, AB, and A* are also regular expressions, corresponding to alternation, catenation, and Kleene closure respectively.,Compiler Design,Properties of Regular Expressions,Let P and Q be a set of stringsThe string s (P|Q) iff s P or s QThe string s P* if
10、f s can be broken into zero or more pieces: s = s1s2s3sn such that each si P.P+ denotes all strings consisting one or more strings in P catenated togetherP* = (P+|) and P+ = PP* = P*PIf A is a set of characters, Not(A) denotes (V-A)all characters in V not included in A.If k is a constant, the set Ak
11、 represents all strings formed by catenating k strings from A, i.e., Ak = (AAA) (k copies),Compiler Design,Examples of Regular Expressions,Let D = (0|9), L = (A|Z)A comment that begins with - and ends with EolComment = -Not(Eol)*EolA fixed decimal literalLit = D+.D+An identifier, composed of letters
12、, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscoresID = L(L|D)*(_(L|D)+)*,Compiler Design,Using a Scanner Generator: Lex,Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab, written in C, running un
13、der UNIX.Lex produces an entire scanner module that can be compiled and linked with other compiler modules.Lex associates regular expressions with arbitrary code fragments. When an expression is matched, the code segment is executed.A typical lex program contains three sections separated by % delimi
14、ters.,Compiler Design,First Section of Lex,The first section define character classes and auxiliary regular expression. (Fig. 3.5 on p. 67) delimits character classes- denotes ranges: xyz = = x-z denotes the escape character: as in C. complements a character class, (Not): xy denotes all characters e
15、xcept x and y.|, *, and + (alternation, Kleene closure, and positive closure) are provided.() can be used to control grouping of subexpressions.(expr)? = = (expr)|, i.e. matches Expr zero times or once. signals the macroexpansion of a symbol defined in the first section.,Compiler Design,First Sectio
16、n of Lex, cont.,Catenation is specified by the juxtaposition of two expressions; no explicit operator is used.abcd will match any of ad, ac, bc, and bd.begin = = “begin” = = begin,Compiler Design,Second Section of Lex,The second section of lex defines a table of regular expressions and corresponding
17、 commands.When an expression is matched, its associated command is executed.Auxiliary functions may be defined in the third section.Input that is matched is stored in the string variable yytext whose length is yyleng.Lex creates an integer function yylex() that may be called from the parser. The val
18、ue returned is usually the token code of the token scanned by Lex.When yylex() encounters end of file, it calls a use-supplied integer function named yywrap() to wrap up input processing.,Compiler Design,Dealing with Multiple Input Files,yylex() uses three user-defined functions to handle character
19、I/O:input(): retrieve a single character, 0 on EOFoutput(c): write a single character to the outputunput(c): put a single character back on the input to be re-read,Compiler Design,Translating Regular Expressions into Finite Automata,Remember the relationship between RE and FA.The main job of a scann
20、er generator program is to transform a regular expression definition into an equivalent (D)FA.A regular expression is first translated into a nondeterministic finite automaton (NFA), then translated from NFA into DFA. (2 steps)An NFA, when reading a particular input is not required to make a unique
21、(deterministic) choice of which state to visit.,Compiler Design,Translating RE into NFA,Any regular expression can be transformed into an NFA with the following properties:There is a unique final stateThe final state has no successorsEvery other state has either one or two successorsRegular expressi
22、ons are built out of the atomic regular expressions a (where a is a character in V) and by using the three operations AB, A|B, and A*.,Compiler Design,NFA for a and l,Compiler Design,An NFA for A|B,Compiler Design,An NFA for A B,Compiler Design,An NFA for A*,Compiler Design,Translating NFA into DFA,
23、Each state of DFA (M) corresponds to a set of states of NFA (N)transforming N to M is done by subset constructionM will be in state x,y,z after reading a given input string if and only if N could be in any of the states x, y, or z, depending on the transitions it chooses.M keeps track of all the possible routes N might take and runs them in parallel.,