util/sconfig/yapps2.tex

   1 \documentclass[10pt]{article}
   2 \usepackage{palatino}
   3 \usepackage{html}
   4 \usepackage{color}
   5
   6 \setlength{\headsep}{0in}
   7 \setlength{\headheight}{0in}
   8 \setlength{\textheight}{8.5in}
   9 \setlength{\textwidth}{5.9in}
  10 \setlength{\oddsidemargin}{0.25in}
  11
  12 \definecolor{darkblue}{rgb}{0,0,0.6}
  13 \definecolor{darkerblue}{rgb}{0,0,0.3}
  14
  15 %% \newcommand{\mysection}[1]{\section{\textcolor{darkblue}{#1}}}
  16 %% \newcommand{\mysubsection}[1]{\subsection{\textcolor{darkerblue}{#1}}}
  17 \newcommand{\mysection}[1]{\section{#1}}
  18 \newcommand{\mysubsection}[1]{\subsection{#1}}
  19
  20 \bodytext{bgcolor=white text=black link=#004080 vlink=#006020}
  21
  22 \newcommand{\first}{\textsc{first}}
  23 \newcommand{\follow}{\textsc{follow}}
  24
  25 \begin{document}
  26
  27 \begin{center}
  28 \hfill \begin{tabular}{c}
  29 {\Large The \emph{Yapps} Parser Generator System}\\
  30 \verb|http://theory.stanford.edu/~amitp/Yapps/|\\
  31                 Version 2\\
  32 \\
  33 Amit J. Patel\\
  34 \htmladdnormallink{http://www-cs-students.stanford.edu/~amitp/}
  35 {http://www-cs-students.stanford.edu/~amitp/}
  36
  37 \end{tabular} \hfill \rule{0in}{0in}
  38 \end{center}
  39
  40 \mysection{Introduction}
  41
  42 \emph{Yapps} (\underline{Y}et \underline{A}nother \underline{P}ython
  43 \underline{P}arser \underline{S}ystem) is an easy to use parser
  44 generator that is written in Python and generates Python code.  There
  45 are several parser generator systems already available for Python,
  46 including \texttt{PyLR, kjParsing, PyBison,} and \texttt{mcf.pars,}
  47 but I had different goals for my parser.  Yapps is simple, is easy to
  48 use, and produces human-readable parsers.  It is not the fastest or
  49 most powerful parser.  Yapps is designed to be used when regular
  50 expressions are not enough and other parser systems are too much:
  51 situations where you may write your own recursive descent parser.
  52
  53 Some unusual features of Yapps that may be of interest are:
  54
  55 \begin{enumerate}
  56
  57  \item Yapps produces recursive descent parsers that are readable by
  58    humans, as opposed to table-driven parsers that are difficult to
  59    read.  A Yapps parser for a simple calculator looks similar to the
  60    one that Mark Lutz wrote by hand for \emph{Programming Python.}
  61
  62  \item Yapps also allows for rules that accept parameters and pass
  63    arguments to be used while parsing subexpressions.  Grammars that
  64    allow for arguments to be passed to subrules and for values to be
  65    passed back are often called \emph{attribute grammars.}  In many
  66    cases parameterized rules can be used to perform actions at ``parse
  67    time'' that are usually delayed until later.  For example,
  68    information about variable declarations can be passed into the
  69    rules that parse a procedure body, so that undefined variables can
  70    be detected at parse time.  The types of defined variables can be
  71    used in parsing as well---for example, if the type of {\tt X} is
  72    known, we can determine whether {\tt X(1)} is an array reference or
  73    a function call.
  74
  75  \item Yapps grammars are fairly easy to write, although there are
  76    some inconveniences having to do with ELL(1) parsing that have to be
  77    worked around.  For example, rules have to be left factored and
  78    rules may not be left recursive.  However, neither limitation seems
  79    to be a problem in practice.
  80
  81    Yapps grammars look similar to the notation used in the Python
  82    reference manual, with operators like \verb:*:, \verb:+:, \verb:|:,
  83    \verb:[]:, and \verb:(): for patterns, names ({\tt tim}) for rules,
  84    regular expressions (\verb:"[a-z]+":) for tokens, and \verb:#: for
  85    comments.
  86
  87  \item The Yapps parser generator is written as a single Python module
  88    with no C extensions.  Yapps produces parsers that are written
  89    entirely in Python, and require only the Yapps run-time module (5k)
  90    for support.
  91
  92  \item Yapps's scanner is context-sensitive, picking tokens based on
  93    the types of the tokens accepted by the parser.  This can be
  94    helpful when implementing certain kinds of parsers, such as for a
  95    preprocessor.
  96
  97 \end{enumerate}
  98
  99 There are several disadvantages of using Yapps over another parser system:
 100
 101 \begin{enumerate}
 102
 103  \item Yapps parsers are \texttt{ELL(1)} (Extended LL(1)), which is
 104    less powerful than \texttt{LALR} (used by \texttt{PyLR}) or
 105    \texttt{SLR} (used by \texttt{kjParsing}), so Yapps would not be a
 106    good choice for parsing complex languages.  For example, allowing
 107    both \texttt{x := 5;} and \texttt{x;} as statements is difficult
 108    because we must distinguish based on only one token of lookahead.
 109    Seeing only \texttt{x}, we cannot decide whether we have an
 110    assignment statement or an expression statement.  (Note however
 111    that this kind of grammar can be matched with backtracking; see
 112    section \ref{sec:future}.)
 113
 114  \item The scanner that Yapps provides can only read from strings, not
 115    files, so an entire file has to be read in before scanning can
 116    begin.  It is possible to build a custom scanner, though, so in
 117    cases where stream input is needed (from the console, a network, or
 118    a large file are examples), the Yapps parser can be given a custom
 119    scanner that reads from a stream instead of a string.
 120
 121  \item Yapps is not designed with efficiency in mind.
 122
 123 \end{enumerate}
 124
 125 Yapps provides an easy to use parser generator that produces parsers
 126 similar to what you might write by hand.  It is not meant to be a
 127 solution for all parsing problems, but instead an aid for those times
 128 you would write a parser by hand rather than using one of the more
 129 powerful parsing packages available.
 130
 131 Yapps 2.0 is easier to use than Yapps 1.0.  New features include a
 132 less restrictive input syntax, which allows mixing of sequences,
 133 choices, terminals, and nonterminals; optional matching; the ability
 134 to insert single-line statements into the generated parser; and
 135 looping constructs \verb|*| and \verb|+| similar to the repetitive
 136 matching constructs in regular expressions.  Unfortunately, the
 137 addition of these constructs has made Yapps 2.0 incompatible with
 138 Yapps 1.0, so grammars will have to be rewritten.  See section
 139 \ref{sec:Upgrading} for tips on changing Yapps 1.0 grammars for use
 140 with Yapps 2.0.
 141
 142 \mysection{Examples}
 143
 144 In this section are several examples that show the use of Yapps.
 145 First, an introduction shows how to construct grammars and write them
 146 in Yapps form.  This example can be skipped by someone familiar with
 147 grammars and parsing.  Next is a Lisp expression grammar that produces
 148 a parse tree as output.  This example demonstrates the use of tokens
 149 and rules, as well as returning values from rules.  The third example
 150 is a expression evaluation grammar that evaluates during parsing
 151 (instead of producing a parse tree).
 152
 153 \mysubsection{Introduction to Grammars}
 154
 155 A \emph{grammar} for a natural language specifies how words can be put
 156 together to form large structures, such as phrases and sentences.  A
 157 grammar for a computer language is similar in that it specifies how
 158 small components (called \emph{tokens}) can be put together to form
 159 larger structures.  In this section we will write a grammar for a tiny
 160 subset of English.
 161
 162 Simple English sentences can be described as being a noun phrase
 163 followed by a verb followed by a noun phrase.  For example, in the
 164 sentence, ``Jack sank the blue ship,'' the word ``Jack'' is the first
 165 noun phrase, ``sank'' is the verb, and ``the blue ship'' is the second
 166 noun phrase.  In addition we should say what a noun phrase is; for
 167 this example we shall say that a noun phrase is an optional article
 168 (a, an, the) followed by any number of adjectives followed by a noun.
 169 The tokens in our language are the articles, nouns, verbs, and
 170 adjectives.  The \emph{rules} in our language will tell us how to
 171 combine the tokens together to form lists of adjectives, noun phrases,
 172 and sentences:
 173
 174 \begin{itemize}
 175   \item \texttt{sentence: noun\_phrase verb noun\_phrase}
 176   \item \texttt{noun\_phrase: [article] adjective* noun}
 177 \end{itemize}
 178
 179 Notice that some things that we said easily in English, such as
 180 ``optional article,'' are expressed using special syntax, such as
 181 brackets.  When we said, ``any number of adjectives,'' we wrote
 182 \texttt{adjective*}, where the \texttt{*} means ``zero or more of the
 183 preceding pattern''.
 184
 185 The grammar given above is close to a Yapps grammar.  We also have to
 186 specify what the tokens are, and what to do when a pattern is matched.
 187 For this example, we will do nothing when patterns are matched; the
 188 next example will explain how to perform match actions.
 189
 190 \begin{verbatim}
 191 parser TinyEnglish:
 192   ignore:          "\\W+"
 193   token noun:      "(Jack|spam|ship)"
 194   token verb:      "(sank|threw)"
 195   token article:   "(a|an|the)"
 196   token adjective: "(blue|red|green)"
 197
 198   rule sentence:       noun_phrase verb noun_phrase
 199   rule noun_phrase:    [article] adjective* noun
 200 \end{verbatim}
 201
 202 The tokens are specified as Python \emph{regular expressions}.  Since
 203 Yapps produces Python code, you can write any regular expression that
 204 would be accepted by Python.  (\emph{Note:} These are Python 1.5
 205 regular expressions from the \texttt{re} module, not Python 1.4
 206 regular expressions from the \texttt{regex} module.)  In addition to
 207 tokens that you want to see (which are given names), you can also
 208 specify tokens to ignore, marked by the \texttt{ignore} keyword.  In
 209 this parser we want to ignore whitespace.
 210
 211 The TinyEnglish grammar shows how you define tokens and rules, but it
 212 does not specify what should happen once we've matched the rules.  In
 213 the next example, we will take a grammar and produce a \emph{parse
 214 tree} from it.
 215
 216 \mysubsection{Lisp Expressions}
 217
 218 Lisp syntax, although hated by many, has a redeeming quality: it is
 219 simple to parse.  In this section we will construct a Yapps grammar to
 220 parse Lisp expressions and produce a parse tree as output.
 221
 222 \subsubsection*{Defining the Grammar}
 223
 224 The syntax of Lisp is simple.  It has expressions, which are
 225 identifiers, strings, numbers, and lists.  A list is a left
 226 parenthesis followed by some number of expressions (separated by
 227 spaces) followed by a right parenthesis.  For example, \verb|5|,
 228 \verb|"ni"|, and \verb|(print "1+2 = " (+ 1 2))| are Lisp expressions.
 229 Written as a grammar,
 230
 231 \begin{verbatim}
 232     expr:   ID | STR | NUM | list
 233     list:   ( expr* )
 234 \end{verbatim}
 235
 236 In addition to having a grammar, we need to specify what to do every
 237 time something is matched.  For the tokens, which are strings, we just
 238 want to get the ``value'' of the token, attach its type (identifier,
 239 string, or number) in some way, and return it.  For the lists, we want
 240 to construct and return a Python list.
 241
 242 Once some pattern is matched, we enclose a return statement enclosed
 243 in \verb|{{...}}|.  The braces allow us to insert any one-line
 244 statement into the parser.  Within this statement, we can refer to the
 245 values returned by matching each part of the rule.  After matching a
 246 token such as \texttt{ID}, ``ID'' will be bound to the text of the
 247 matched token.  Let's take a look at the rule:
 248
 249 \begin{verbatim}
 250     rule expr: ID   {{ return ('id', ID) }}
 251       ...
 252 \end{verbatim}
 253
 254 In a rule, tokens return the text that was matched.  For identifiers,
 255 we just return the identifier, along with a ``tag'' telling us that
 256 this is an identifier and not a string or some other value.  Sometimes
 257 we may need to convert this text to a different form.  For example, if
 258 a string is matched, we want to remove quotes and handle special forms
 259 like \verb|\n|.  If a number is matched, we want to convert it into a
 260 number.  Let's look at the return values for the other tokens:
 261
 262 \begin{verbatim}
 263       ...
 264              | STR  {{ return ('str', eval(STR)) }}
 265              | NUM  {{ return ('num', atoi(NUM)) }}
 266       ...
 267 \end{verbatim}
 268
 269 If we get a string, we want to remove the quotes and process any
 270 special backslash codes, so we run \texttt{eval} on the quoted string.
 271 If we get a number, we convert it to an integer with \texttt{atoi} and
 272 then return the number along with its type tag.
 273
 274 For matching a list, we need to do something slightly more
 275 complicated.  If we match a Lisp list of expressions, we want to
 276 create a Python list with those values.
 277
 278 \begin{verbatim}
 279     rule list: "\\("                 # Match the opening parenthesis
 280                {{ result = [] }}     # Create a Python list
 281                (
 282                   expr               # When we match an expression,
 283                   {{ result.append(expr) }}   # add it to the list
 284                )*                    # * means repeat this if needed
 285                "\\)"                 # Match the closing parenthesis
 286                {{ return result }}   # Return the Python list
 287 \end{verbatim}
 288
 289 In this rule we first match the opening parenthesis, then go into a
 290 loop.  In this loop we match expressions and add them to the list.
 291 When there are no more expressions to match, we match the closing
 292 parenthesis and return the resulting.  Note that \verb:#: is used for
 293 comments, just as in Python.
 294
 295 The complete grammar is specified as follows:
 296 \begin{verbatim}
 297 parser Lisp:
 298     ignore:      '\\s+'
 299     token NUM:   '[0-9]+'
 300     token ID:    '[-+*/!@%^&=.a-zA-Z0-9_]+'
 301     token STR:   '"([^\\"]+|\\\\.)*"'
 302
 303     rule expr:   ID     {{ return ('id', ID) }}
 304                | STR    {{ return ('str', eval(STR)) }}
 305                | NUM    {{ return ('num', atoi(NUM)) }}
 306                | list   {{ return list }}
 307     rule list: "\\("    {{ result = [] }}
 308                ( expr   {{ result.append(expr) }}
 309                )*
 310                "\\)"    {{ return result }}
 311 \end{verbatim}
 312
 313 One thing you may have noticed is that \verb|"\\("| and \verb|"\\)"|
 314 appear in the \texttt{list} rule.  These are \emph{inline tokens}:
 315 they appear in the rules without being given a name with the
 316 \texttt{token} keyword.  Inline tokens are more convenient to use, but
 317 since they do not have a name, the text that is matched cannot be used
 318 in the return value.  They are best used for short simple patterns
 319 (usually punctuation or keywords).
 320
 321 Another thing to notice is that the number and identifier tokens
 322 overlap.  For example, ``487'' matches both NUM and ID.  In Yapps, the
 323 scanner only tries to match tokens that are acceptable to the parser.
 324 This rule doesn't help here, since both NUM and ID can appear in the
 325 same place in the grammar.  There are two rules used to pick tokens if
 326 more than one matches.  One is that the \emph{longest} match is
 327 preferred.  For example, ``487x'' will match as an ID (487x) rather
 328 than as a NUM (487) followed by an ID (x).  The second rule is that if
 329 the two matches are the same length, the \emph{first} one listed in
 330 the grammar is preferred.  For example, ``487'' will match as an NUM
 331 rather than an ID because NUM is listed first in the grammar.  Inline
 332 tokens have preference over any tokens you have listed.
 333
 334 Now that our grammar is defined, we can run Yapps to produce a parser,
 335 and then run the parser to produce a parse tree.
 336
 337 \subsubsection*{Running Yapps}
 338
 339 In the Yapps module is a function \texttt{generate} that takes an
 340 input filename and writes a parser to another file.  We can use this
 341 function to generate the Lisp parser, which is assumed to be in
 342 \texttt{lisp.g}.
 343
 344 \begin{verbatim}
 345 % python
 346 Python 1.5.1 (#1, Sep  3 1998, 22:51:17)  [GCC 2.7.2.3] on linux-i386
 347 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
 348 >>> import yapps
 349 >>> yapps.generate('lisp.g')
 350 \end{verbatim}
 351
 352 At this point, Yapps has written a file \texttt{lisp.py} that contains
 353 the parser.  In that file are two classes (one scanner and one parser)
 354 and a function (called \texttt{parse}) that puts things together for
 355 you.
 356
 357 Alternatively, we can run Yapps from the command line to generate the
 358 parser file:
 359
 360 \begin{verbatim}
 361 % python yapps.py lisp.g
 362 \end{verbatim}
 363
 364 After running Yapps either from within Python or from the command
 365 line, we can use the Lisp parser by calling the \texttt{parse}
 366 function.  The first parameter should be the rule we want to match,
 367 and the second parameter should be the string to parse.
 368
 369 \begin{verbatim}
 370 >>> import lisp
 371 >>> lisp.parse('expr', '(+ 3 4)')
 372 [('id', '+'), ('num', 3), ('num', 4)]
 373 >>> lisp.parse('expr', '(print "3 = " (+ 1 2))')
 374 [('id', 'print'), ('str', '3 = '), [('id', '+'), ('num', 1), ('num', 2)]]
 375 \end{verbatim}
 376
 377 The \texttt{parse} function is not the only way to use the parser;
 378 section \ref{sec:Parser-Objects} describes how to access parser objects
 379 directly.
 380
 381 We've now gone through the steps in creating a grammar, writing a
 382 grammar file for Yapps, producing a parser, and using the parser.  In
 383 the next example we'll see how rules can take parameters and also how
 384 to do computations instead of just returning a parse tree.
 385
 386 \mysubsection{Calculator}
 387
 388 A common example parser given in many textbooks is that for simple
 389 expressions, with numbers, addition, subtraction, multiplication,
 390 division, and parenthesization of subexpressions.  We'll write this
 391 example in Yapps, evaluating the expression as we parse.
 392
 393 Unlike \texttt{yacc}, Yapps does not have any way to specify
 394 precedence rules, so we have to do it ourselves.  We say that an
 395 expression is the sum of terms, and that a term is the product of
 396 factors, and that a factor is a number or a parenthesized expression:
 397
 398 \begin{verbatim}
 399     expr:           factor ( ("+"|"-") factor )*
 400     factor:         term   ( ("*"|"/") term )*
 401     term:           NUM | "(" expr ")"
 402 \end{verbatim}
 403
 404 In order to evaluate the expression as we go, we should keep along an
 405 accumulator while evaluating the lists of terms or factors.  Just as
 406 we kept a ``result'' variable to build a parse tree for Lisp
 407 expressions, we will use a variable to evaluate numerical
 408 expressions.  The full grammar is given below:
 409
 410 \begin{verbatim}
 411 parser Calculator:
 412     token END: "$"         # $ means end of string
 413     token NUM: "[0-9]+"
 414
 415     rule goal:           expr END         {{ return expr }}
 416
 417     # An expression is the sum and difference of factors
 418     rule expr:           factor           {{ v = factor }}
 419                        ( "[+]" factor       {{ v = v+factor }}
 420                        |  "-"  factor       {{ v = v-factor }}
 421                        )*                 {{ return v }}
 422
 423     # A factor is the product and division of terms
 424     rule factor:         term             {{ v = term }}
 425                        ( "[*]" term         {{ v = v*term }}
 426                        |  "/"  term         {{ v = v/term }}
 427                        )*                 {{ return v }}
 428
 429     # A term is either a number or an expression surrounded by parentheses
 430     rule term:           NUM              {{ return atoi(NUM) }}
 431                        | "\\(" expr "\\)" {{ return expr }}
 432 \end{verbatim}
 433
 434 The top-level rule is \emph{goal}, which says that we are looking for
 435 an expression followed by the end of the string.  The \texttt{END}
 436 token is needed because without it, it isn't clear when to stop
 437 parsing.  For example, the string ``1+3'' could be parsed either as
 438 the expression ``1'' followed by the string ``+3'' or it could be
 439 parsed as the expression ``1+3''.  By requiring expressions to end
 440 with \texttt{END}, the parser is forced to take ``1+3''.
 441
 442 In the two rules with repetition, the accumulator is named \texttt{v}.
 443 After reading in one expression, we initialize the accumulator.  Each
 444 time through the loop, we modify the accumulator by adding,
 445 subtracting, multiplying by, or dividing the previous accumulator by
 446 the expression that has been parsed.  At the end of the rule, we
 447 return the accumulator.
 448
 449 The calculator example shows how to process lists of elements using
 450 loops, as well as how to handle precedence of operators.
 451
 452 \emph{Note:} It's often important to put the \texttt{END} token in, so
 453 put it in unless you are sure that your grammar has some other
 454 non-ambiguous token marking the end of the program.
 455
 456 \mysubsection{Calculator with Memory}
 457
 458 In the previous example we learned how to write a calculator that
 459 evaluates simple numerical expressions.  In this section we will
 460 extend the example to support both local and global variables.
 461
 462 To support global variables, we will add assignment statements to the
 463 ``goal'' rule.
 464
 465 \begin{verbatim}
 466     rule goal:           expr END         {{ return expr }}
 467               | 'set' ID expr END         {{ global_vars[ID] = expr }}
 468                                           {{ return expr }}
 469 \end{verbatim}
 470
 471 To use these variables, we need a new kind of terminal:
 472
 473 \begin{verbatim}
 474     rule term: ... | ID {{ return global_vars[ID] }}
 475 \end{verbatim}
 476
 477 So far, these changes are straightforward.  We simply have a global
 478 dictionary \texttt{global\_vars} that stores the variables and values,
 479 we modify it when there is an assignment statement, and we look up
 480 variables in it when we see a variable name.
 481
 482 To support local variables, we will add variable declarations to the
 483 set of allowed expressions.
 484
 485 \begin{verbatim}
 486     rule term: ... | 'let' VAR '=' expr 'in' expr ...
 487 \end{verbatim}
 488
 489 This is where it becomes tricky.  Local variables should be stored in
 490 a local dictionary, not in the global one.  One trick would be to save
 491 a copy of the global dictionary, modify it, and then restore it
 492 later.  In this example we will instead use \emph{attributes} to
 493 create local information and pass it to subrules.
 494
 495 A rule can optionally take parameters.  When we invoke the rule, we
 496 must pass in arguments.  For local variables, let's use a single
 497 parameter, \texttt{local\_vars}:
 498
 499 \begin{verbatim}
 500     rule expr<<local_vars>>:   ...
 501     rule factor<<local_vars>>: ...
 502     rule term<<local_vars>>:   ...
 503 \end{verbatim}
 504
 505 Each time we want to match \texttt{expr}, \texttt{factor}, or
 506 \texttt{term}, we will pass the local variables in the current rule to
 507 the subrule.  One interesting case is when we pass as an argument
 508 something \emph{other} than \texttt{local\_vars}:
 509
 510 \begin{verbatim}
 511    rule term<<local_vars>>: ...
 512                 | 'let' VAR '=' expr<<local_vars>>
 513                   {{ local_vars = [(VAR, expr)] + local_vars }}
 514                   'in' expr<<local_vars>>
 515                   {{ return expr }}
 516 \end{verbatim}
 517
 518 Note that the assignment to the local variables list does not modify
 519 the original list.  This is important to keep local variables from
 520 being seen outside the ``let''.
 521
 522 The other interesting case is when we find a variable:
 523
 524 \begin{verbatim}
 525 global_vars = {}
 526
 527 def lookup(map, name):
 528     for x,v in map:  if x==name: return v
 529     return global_vars[name]
 530 %%
 531    ...
 532    rule term<<local_vars>: ...
 533                 | VAR {{ return lookup(local_vars, VAR) }}
 534 \end{verbatim}
 535
 536 The lookup function will search through the local variable list, and
 537 if it cannot find the name there, it will look it up in the global
 538 variable dictionary.
 539
 540 A complete grammar for this example, including a read-eval-print loop
 541 for interacting with the calculator, can be found in the examples
 542 subdirectory included with Yapps.
 543
 544 In this section we saw how to insert code before the parser.  We also
 545 saw how to use attributes to transmit local information from one rule
 546 to its subrules.
 547
 548 \mysection{Grammars}
 549
 550 Each Yapps grammar has a name, a list of tokens, and a set of
 551 production rules.  A grammar named \texttt{X} will be used to produce
 552 a parser named \texttt{X} and a scanner anmed \texttt{XScanner}.  As
 553 in Python, names are case sensitive, start with a letter, and contain
 554 letters, numbers, and underscores (\_).
 555
 556 There are three kinds of tokens in Yapps: named, inline, and ignored.
 557 As their name implies, named tokens are given a name, using the token
 558 construct: \texttt{token \emph{name} : \emph{regexp}}.  In a rule, the
 559 token can be matched by using the name.  Inline tokens are regular
 560 expressions that are used in rules without being declared.  Ignored
 561 tokens are declared using the ignore construct: \texttt{ignore:
 562   \emph{regexp}}.  These tokens are ignored by the scanner, and are
 563 not seen by the parser.  Often whitespace is an ignored token.  The
 564 regular expressions used to define tokens should use the syntax
 565 defined in the \texttt{re} module, so some symbols may have to be
 566 backslashed.
 567
 568 Production rules in Yapps have a name and a pattern to match.  If the
 569 rule is parameterized, the name should be followed by a list of
 570 parameter names in \verb|<<...>>|.  A pattern can be a simple pattern
 571 or a compound pattern.  Simple patterns are the name of a named token,
 572 a regular expression in quotes (inline token), the name of a
 573 production rule (followed by arguments in \verb|<<...>>|, if the rule
 574 has parameters), and single line Python statements (\verb|{{...}}|).
 575 Compound patterns are sequences (\verb|A B C ...|), choices (
 576 \verb:A | B | C | ...:), options (\verb|[...]|), zero-or-more repetitions
 577 (\verb|...*|), and one-or-more repetitions (\verb|...+|).  Like
 578 regular expressions, repetition operators have a higher precedence
 579 than sequences, and sequences have a higher precedence than choices.
 580
 581 Whenever \verb|{{...}}| is used, a legal one-line Python statement
 582 should be put inside the braces.  The token \verb|}}| should not
 583 appear within the \verb|{{...}}| section, even within a string, since
 584 Yapps does not attempt to parse the Python statement.  A workaround
 585 for strings is to put two strings together (\verb|"}" "}"|), or to use
 586 backslashes (\verb|"}\}"|).  At the end of a rule you should use a
 587 \verb|{{ return X }}| statement to return a value.  However, you
 588 should \emph{not} use any control statements (\texttt{return},
 589 \texttt{continue}, \texttt{break}) in the middle of a rule.  Yapps
 590 needs to make assumptions about the control flow to generate a parser,
 591 and any changes to the control flow will confuse Yapps.
 592
 593 The \verb|<<...>>| form can occur in two places: to define parameters
 594 to a rule and to give arguments when matching a rule.  Parameters use
 595 the syntax used for Python functions, so they can include default
 596 arguments and the special forms (\verb|*args| and \verb|**kwargs|).
 597 Arguments use the syntax for Python function call arguments, so they
 598 can include normal arguments and keyword arguments.  The token
 599 \verb|>>| should not appear within the \verb|<<...>>| section.
 600
 601 In both the statements and rule arguments, you can use names defined
 602 by the parser to refer to matched patterns.  You can refer to the text
 603 matched by a named token by using the token name.  You can use the
 604 value returned by a production rule by using the name of that rule.
 605 If a name \texttt{X} is matched more than once (such as in loops), you
 606 will have to save the earlier value(s) in a temporary variable, and
 607 then use that temporary variable in the return value.  The next
 608 section has an example of a name that occurs more than once.
 609
 610 \mysubsection{Left Factoring}
 611 \label{sec:Left-Factoring}
 612
 613 Yapps produces ELL(1) parsers, which determine which clause to match
 614 based on the first token available.  Sometimes the leftmost tokens of
 615 several clauses may be the same.  The classic example is the
 616 \emph{if/then/else} construct in Pascal:
 617
 618 \begin{verbatim}
 619 rule stmt:  "if" expr "then" stmt {{ then_part = stmt }}
 620                       "else" stmt {{ return ('If',expr,then_part,stmt) }}
 621           | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }}
 622 \end{verbatim}
 623
 624 (Note that we have to save the first \texttt{stmt} into a variable
 625 because there is another \texttt{stmt} that will be matched.)  The
 626 left portions of the two clauses are the same, which presents a
 627 problem for the parser.  The solution is \emph{left-factoring}: the
 628 common parts are put together, and \emph{then} a choice is made about
 629 the remaining part:
 630
 631 \begin{verbatim}
 632 rule stmt:  "if" expr
 633               "then" stmt {{ then_part = stmt }}
 634               {{ else_part = [] }}
 635               [ "else" stmt {{ else_part = stmt }} ]
 636             {{ return ('If', expr, then_part, else_part) }}
 637 \end{verbatim}
 638
 639 Unfortunately, the classic \emph{if/then/else} situation is
 640 \emph{still} ambiguous when you left-factor.  Yapps can deal with this
 641 situation, but will report a warning; see section
 642 \ref{sec:Ambiguous-Grammars} for details.
 643
 644 In general, replace rules of the form:
 645
 646 \begin{verbatim}
 647 rule A:   a b1 {{ return E1 }}
 648         | a b2 {{ return E2 }}
 649         | c3   {{ return E3 }}
 650         | c4   {{ return E4 }}
 651 \end{verbatim}
 652
 653 with rules of the form:
 654
 655 \begin{verbatim}
 656 rule A:   a ( b1 {{ return E1 }}
 657             | b2 {{ return E2 }}
 658             )
 659         | c3   {{ return E3 }}
 660         | c4   {{ return E4 }}
 661 \end{verbatim}
 662
 663 \mysubsection{Left Recursion}
 664
 665 A common construct in grammars is for matching a list of patterns,
 666 sometimes separated with delimiters such as commas or semicolons.  In
 667 LR-based parser systems, we can parse a list with something like this:
 668
 669 \begin{verbatim}
 670 rule sum:  NUM             {{ return NUM }}
 671          | sum "+" NUM     {{ return (sum, NUM) }}
 672 \end{verbatim}
 673
 674 Parsing \texttt{1+2+3+4} would produce the output
 675 \texttt{(((1,2),3),4)}, which is what we want from a left-associative
 676 addition operator.  Unfortunately, this grammar is \emph{left
 677 recursive,} because the \texttt{sum} rule contains a clause that
 678 begins with \texttt{sum}.  (The recursion occurs at the left side of
 679 the clause.)
 680
 681 We must restructure this grammar to be \emph{right recursive} instead:
 682
 683 \begin{verbatim}
 684 rule sum:  NUM             {{ return NUM }}
 685          | NUM "+" sum     {{ return (NUM, sum) }}
 686 \end{verbatim}
 687
 688 Unfortunately, using this grammar, \texttt{1+2+3+4} would be parsed as
 689 \texttt{(1,(2,(3,4)))}, which no longer follows left associativity.
 690 The rule also needs to be left-factored.  Instead, we write the
 691 pattern as a loop instead:
 692
 693 \begin{verbatim}
 694 rule sum:       NUM {{ v = NUM }}
 695                 ( "[+]" NUM {{ v = (v,NUM) }} )*
 696                 {{ return v }}
 697 \end{verbatim}
 698
 699 In general, replace rules of the form:
 700
 701 \begin{verbatim}
 702 rule A:  A a1 -> << E1 >>
 703        | A a2 -> << E2 >>
 704        | b3   -> << E3 >>
 705        | b4   -> << E4 >>
 706 \end{verbatim}
 707
 708 with rules of the form:
 709
 710 \begin{verbatim}
 711 rule A:  ( b3 {{ A = E3 }}
 712          | b4 {{ A = E4 }} )
 713          ( a1 {{ A = E1 }}
 714          | a2 {{ A = E2 }} )*
 715          {{ return A }}
 716 \end{verbatim}
 717
 718 We have taken a rule that proved problematic for with recursion and
 719 turned it into a rule that works well with looping constructs.
 720
 721 \mysubsection{Ambiguous Grammars}
 722 \label{sec:Ambiguous-Grammars}
 723
 724 In section \ref{sec:Left-Factoring} we saw the classic if/then/else
 725 ambiguity, which occurs because the ``else \ldots'' portion of an ``if
 726 \ldots then \ldots else \ldots'' construct is optional.  Programs with
 727 nested if/then/else constructs can be ambiguous when one of the else
 728 clauses is missing:
 729 \begin{verbatim}
 730 if 1 then            if 1 then
 731     if 5 then            if 5 then
 732         x := 1;              x := 1;
 733     else             else
 734         y := 9;          y := 9;
 735 \end{verbatim}
 736
 737 The indentation shows that the program can be parsed in two different
 738 ways.  (Of course, if we all would adopt Python's indentation-based
 739 structuring, this would never happen!)  Usually we want the parsing on
 740 the left: the ``else'' should be associated with the closest ``if''
 741 statement.  In section \ref{sec:Left-Factoring} we ``solved'' the
 742 problem by using the following grammar:
 743
 744 \begin{verbatim}
 745 rule stmt:  "if" expr
 746               "then" stmt {{ then_part = stmt }}
 747               {{ else_part = [] }}
 748               [ "else" stmt {{ else_part = stmt }} ]
 749             {{ return ('If', expr, then_part, else_part) }}
 750 \end{verbatim}
 751
 752 Here, we have an optional match of ``else'' followed by a statement.
 753 The ambiguity is that if an ``else'' is present, it is not clear
 754 whether you want it parsed immediately or if you want it to be parsed
 755 by the outer ``if''.
 756
 757 Yapps will deal with the situation by matching when the else pattern
 758 when it can.  The parser will work in this case because it prefers the
 759 \emph{first} matching clause, which tells Yapps to parse the ``else''.
 760 That is exactly what we want!
 761
 762 For ambiguity cases with choices, Yapps will choose the \emph{first}
 763 matching choice.  However, remember that Yapps only looks at the first
 764 token to determine its decision, so {\tt (a b | a c)} will result in
 765 Yapps choosing {\tt a b} even when the input is {\tt a c}.  It only
 766 looks at the first token, {\tt a}, to make its decision.
 767
 768 \mysection{Customization}
 769
 770 Both the parsers and the scanners can be customized.  The parser is
 771 usually extended by subclassing, and the scanner can either be
 772 subclassed or completely replaced.
 773
 774 \mysubsection{Customizing Parsers}
 775
 776 If additional fields and methods are needed in order for a parser to
 777 work, Python subclassing can be used.  (This is unlike parser classes
 778 written in static languages, in which these fields and methods must be
 779 defined in the generated parser class.)  We simply subclass the
 780 generated parser, and add any fields or methods required.  Expressions
 781 in the grammar can call methods of the subclass to perform any actions
 782 that cannot be expressed as a simple expression.  For example,
 783 consider this simple grammar:
 784
 785 \begin{verbatim}
 786 parser X:
 787     rule goal:  "something"  {{ self.printmsg() }}
 788 \end{verbatim}
 789
 790 The \texttt{printmsg} function need not be implemented in the parser
 791 class \texttt{X}; it can be implemented in a subclass:
 792
 793 \begin{verbatim}
 794 import Xparser
 795
 796 class MyX(Xparser.X):
 797     def printmsg(self):
 798         print "Hello!"
 799 \end{verbatim}
 800
 801 \mysubsection{Customizing Scanners}
 802
 803 The generated parser class is not dependent on the generated scanner
 804 class.  A scanner object is passed to the parser object's constructor
 805 in the \texttt{parse} function.  To use a different scanner, write
 806 your own function to construct parser objects, with an instance of a
 807 different scanner.  Scanner objects must have a \texttt{token} method
 808 that accepts an integer \texttt{N} as well as a list of allowed token
 809 types, and returns the Nth token, as a tuple.  The default scanner
 810 raises \texttt{NoMoreTokens} if no tokens are available, and
 811 \texttt{SyntaxError} if no token could be matched.  However, the
 812 parser does not rely on these exceptions; only the \texttt{parse}
 813 convenience function (which calls \texttt{wrap\_error\_reporter}) and
 814 the \texttt{print\_error} error display function use those exceptions.
 815
 816 The tuples representing tokens have four elements.  The first two are
 817 the beginning and ending indices of the matched text in the input
 818 string.  The third element is the type tag, matching either the name
 819 of a named token or the quoted regexp of an inline or ignored token.
 820 The fourth element of the token tuple is the matched text.  If the
 821 input string is \texttt{s}, and the token tuple is
 822 \texttt{(b,e,type,val)}, then \texttt{val} should be equal to
 823 \texttt{s[b:e]}.
 824
 825 The generated parsers do not the beginning or ending index.  They use
 826 only the token type and value.  However, the default error reporter
 827 uses the beginning and ending index to show the user where the error
 828 is.
 829
 830 \mysection{Parser Mechanics}
 831
 832 The base parser class (Parser) defines two methods, \texttt{\_scan}
 833 and \texttt{\_peek}, and two fields, \texttt{\_pos} and
 834 \texttt{\_scanner}.  The generated parser inherits from the base
 835 parser, and contains one method for each rule in the grammar.  To
 836 avoid name clashes, do not use names that begin with an underscore
 837 (\texttt{\_}).
 838
 839 \mysubsection{Parser Objects}
 840 \label{sec:Parser-Objects}
 841
 842 Yapps produces as output two exception classes, a scanner class, a
 843 parser class, and a function \texttt{parse} that puts everything
 844 together.  The \texttt{parse} function does not have to be used;
 845 instead, one can create a parser and scanner object and use them
 846 together for parsing.
 847
 848 \begin{verbatim}
 849     def parse(rule, text):
 850         P = X(XScanner(text))
 851         return wrap_error_reporter(P, rule)
 852 \end{verbatim}
 853
 854 The \texttt{parse} function takes a name of a rule and an input string
 855 as input.  It creates a scanner and parser object, then calls
 856 \texttt{wrap\_error\_reporter} to execute the method in the parser
 857 object named \texttt{rule}.  The wrapper function will call the
 858 appropriate parser rule and report any parsing errors to standard
 859 output.
 860
 861 There are several situations in which the \texttt{parse} function
 862 would not be useful.  If a different parser or scanner is being used,
 863 or exceptions are to be handled differently, a new \texttt{parse}
 864 function would be required.  The supplied \texttt{parse} function can
 865 be used as a template for writing a function for your own needs.  An
 866 example of a custom parse function is the \texttt{generate} function
 867 in \texttt{Yapps.py}.
 868
 869 \mysubsection{Context Sensitive Scanner}
 870
 871 Unlike most scanners, the scanner produced by Yapps can take into
 872 account the context in which tokens are needed, and try to match only
 873 good tokens.  For example, in the grammar:
 874
 875 \begin{verbatim}
 876 parser IniFile:
 877    token ID:   "[a-zA-Z_0-9]+"
 878    token VAL:  ".*"
 879
 880    rule pair:  ID "[ \t]*=[ \t]*" VAL "\n"
 881 \end{verbatim}
 882
 883 we would like to scan lines of text and pick out a name/value pair.
 884 In a conventional scanner, the input string \texttt{shell=progman.exe}
 885 would be turned into a single token of type \texttt{VAL}.  The Yapps
 886 scanner, however, knows that at the beginning of the line, an
 887 \texttt{ID} is expected, so it will return \texttt{"shell"} as a token
 888 of type \texttt{ID}.  Later, it will return \texttt{"progman.exe"} as
 889 a token of type \texttt{VAL}.
 890
 891 Context sensitivity decreases the separation between scanner and
 892 parser, but it is useful in parsers like \texttt{IniFile}, where the
 893 tokens themselves are not unambiguous, but \emph{are} unambiguous
 894 given a particular stage in the parsing process.
 895
 896 Unfortunately, context sensitivity can make it more difficult to
 897 detect errors in the input.  For example, in parsing a Pascal-like
 898 language with ``begin'' and ``end'' as keywords, a context sensitive
 899 scanner would only match ``end'' as the END token if the parser is in
 900 a place that will accept the END token.  If not, then the scanner
 901 would match ``end'' as an identifier.  To disable the context
 902 sensitive scanner in Yapps, add the
 903 \texttt{context-insensitive-scanner} option to the grammar:
 904
 905 \begin{verbatim}
 906 Parser X:
 907     option:  "context-insensitive-scanner"
 908 \end{verbatim}
 909
 910 Context-insensitive scanning makes the parser look cleaner as well.
 911
 912 \mysubsection{Internal Variables}
 913
 914 There are two internal fields that may be of use.  The parser object
 915 has two fields, \texttt{\_pos}, which is the index of the current
 916 token being matched, and \texttt{\_scanner}, which is the scanner
 917 object.  The token itself can be retrieved by accessing the scanner
 918 object and calling the \texttt{token} method with the token index.  However, if you call \texttt{token} before the token has been requested by the parser, it may mess up a context-sensitive scanner.\footnote{When using a context-sensitive scanner, the parser tells the scanner what the valid token types are at each point.  If you call \texttt{token} before the parser can tell the scanner the valid token types, the scanner will attempt to match without considering the context.}  A
 919 potentially useful combination of these fields is to extract the
 920 portion of the input matched by the current rule.  To do this, just save the scanner state (\texttt{\_scanner.pos}) before the text is matched and then again after the text is matched:
 921
 922 \begin{verbatim}
 923   rule R:
 924       {{ start = self._scanner.pos }}
 925       a b c
 926       {{ end = self._scanner.pos }}
 927       {{ print 'Text is', self._scanner.input[start:end] }}
 928 \end{verbatim}
 929
 930 \mysubsection{Pre- and Post-Parser Code}
 931
 932 Sometimes the parser code needs to rely on helper variables,
 933 functions, and classes.  A Yapps grammar can optionally be surrounded
 934 by double percent signs, to separate the grammar from Python code.
 935
 936 \begin{verbatim}
 937 ... Python code ...
 938 %%
 939 ... Yapps grammar ...
 940 %%
 941 ... Python code ...
 942 \end{verbatim}
 943
 944 The second \verb|%%| can be omitted if there is no Python code at the
 945 end, and the first \verb|%%| can be omitted if there is no extra
 946 Python code at all.  (To have code only at the end, both separators
 947 are required.)
 948
 949 If the second \verb|%%| is omitted, Yapps will insert testing code
 950 that allows you to use the generated parser to parse a file.
 951
 952 The extended calculator example in the Yapps examples subdirectory
 953 includes both pre-parser and post-parser code.
 954
 955 \mysubsection{Representation of Grammars}
 956
 957 For each kind of pattern there is a class derived from Pattern.  Yapps
 958 has classes for Terminal, NonTerminal, Sequence, Choice, Option, Plus,
 959 Star, and Eval.  Each of these classes has the following interface:
 960
 961 \begin{itemize}
 962  \item[setup(\emph{gen})] Set accepts-$\epsilon$, and call
 963    \emph{gen.changed()} if it changed.  This function can change the
 964    flag from false to true but \emph{not} from true to false.
 965  \item[update(\emph(gen))] Set \first and \follow, and call
 966    \emph{gen.changed()} if either changed.  This function can add to
 967    the sets but \emph{not} remove from them.
 968  \item[output(\emph{gen}, \emph{indent})] Generate code for matching
 969    this rule, using \emph{indent} as the current indentation level.
 970    Writes are performed using \emph{gen.write}.
 971  \item[used(\emph{vars})] Given a list of variables \emph{vars},
 972    return two lists: one containing the variables that are used, and
 973    one containing the variables that are assigned.  This function is
 974    used for optimizing the resulting code.
 975 \end{itemize}
 976
 977 Both \emph{setup} and \emph{update} monotonically increase the
 978 variables they modify.  Since the variables can only increase a finite
 979 number of times, we can repeatedly call the function until the
 980 variable stabilized.  The \emph{used} function is not currently
 981 implemented.
 982
 983 With each pattern in the grammar Yapps associates three pieces of
 984 information: the \first set, the \follow set, and the
 985 accepts-$\epsilon$ flag.
 986
 987 The \first set contains the tokens that can appear as we start
 988 matching the pattern.  The \follow set contains the tokens that can
 989 appear immediately after we match the pattern.  The accepts-$\epsilon$
 990 flag is true if the pattern can match no tokens.  In this case, \first
 991 will contain all the elements in \follow.  The \follow set is not
 992 needed when accepts-$\epsilon$ is false, and may not be accurate in
 993 those cases.
 994
 995 Yapps does not compute these sets precisely.  Its approximation can
 996 miss certain cases, such as this one:
 997
 998 \begin{verbatim}
 999   rule C: ( A* | B )
1000   rule B: C [A]
1001 \end{verbatim}
1002
1003 Yapps will calculate {\tt C}'s \follow set to include {\tt A}.
1004 However, {\tt C} will always match all the {\tt A}'s, so {\tt A} will
1005 never follow it.  Yapps 2.0 does not properly handle this construct,
1006 but if it seems important, I may add support for it in a future
1007 version.
1008
1009 Yapps also cannot handle constructs that depend on the calling
1010 sequence.  For example:
1011
1012 \begin{verbatim}
1013   rule R: U | 'b'
1014   rule S:   | 'c'
1015   rule T: S 'b'
1016   rule U: S 'a'
1017 \end{verbatim}
1018
1019 The \follow set for {\tt S} includes {\tt a} and {\tt b}.  Since {\tt
1020   S} can be empty, the \first set for {\tt S} should include {\tt a},
1021 {\tt b}, and {\tt c}.  However, when parsing {\tt R}, if the lookahead
1022 is {\tt b} we should \emph{not} parse {\tt U}.  That's because in {\tt
1023   U}, {\tt S} is followed by {\tt a} and not {\tt b}.  Therefore in
1024 {\tt R}, we should choose rule {\tt U} only if there is an {\tt a} or
1025 {\tt c}, but not if there is a {\tt b}.  Yapps and many other LL(1)
1026 systems do not distinguish {\tt S b} and {\tt S a}, making {\tt
1027   S}'s \follow set {\tt a, b}, and making {\tt R} always try to match
1028 {\tt U}.  In this case we can solve the problem by changing {\tt R} to
1029 \verb:'b' | U: but it may not always be possible to solve all such
1030 problems in this way.
1031
1032 \appendix
1033
1034 \mysection{Grammar for Parsers}
1035
1036 This is the grammar for parsers, without any Python code mixed in.
1037 The complete grammar can be found in \texttt{parsedesc.g} in the Yapps
1038 distribution.
1039
1040 \begin{verbatim}
1041 parser ParserDescription:
1042     ignore:      "\\s+"
1043     ignore:      "#.*?\r?\n"
1044     token END:   "$"  # $ means end of string
1045     token ATTR:  "<<.+?>>"
1046     token STMT:  "{{.+?}}"
1047     token ID:    '[a-zA-Z_][a-zA-Z_0-9]*'
1048     token STR:   '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"'
1049
1050     rule Parser: "parser" ID ":"
1051                    Options
1052                    Tokens
1053                    Rules
1054                  END
1055
1056     rule Options:  ( "option" ":" STR )*
1057     rule Tokens:   ( "token" ID ":" STR | "ignore"   ":" STR )*
1058     rule Rules:    ( "rule" ID OptParam ":" ClauseA )*
1059
1060     rule ClauseA:  ClauseB ( '[|]' ClauseB )*
1061     rule ClauseB:  ClauseC*
1062     rule ClauseC:  ClauseD [ '[+]' | '[*]' ]
1063     rule ClauseD:  STR | ID [ATTR] | STMT
1064                  | '\\(' ClauseA '\\) | '\\[' ClauseA '\\]'
1065 \end{verbatim}
1066
1067 \mysection{Upgrading}
1068
1069 Yapps 2.0 is not backwards compatible with Yapps 1.0.  In this section
1070 are some tips for upgrading:
1071
1072 \begin{enumerate}
1073  \item Yapps 1.0 was distributed as a single file.  Yapps 2.0 is
1074    instead distributed as two Python files: a \emph{parser generator}
1075    (26k) and a \emph{parser runtime} (5k).  You need both files to
1076    create parsers, but you need only the runtime (\texttt{yappsrt.py})
1077    to use the parsers.
1078
1079  \item Yapps 1.0 supported Python 1.4 regular expressions from the
1080    \texttt{regex} module.  Yapps 2.0 uses Python 1.5 regular
1081    expressions from the \texttt{re} module.  \emph{The new syntax for
1082      regular expressions is not compatible with the old syntax.}
1083    Andrew Kuchling has a \htmladdnormallink{guide to converting
1084      regular
1085      expressions}{http://www.python.org/doc/howto/regex-to-re/} on his
1086    web page.
1087
1088  \item Yapps 1.0 wants a pattern and then a return value in \verb|->|
1089    \verb|<<...>>|.  Yapps 2.0 allows patterns and Python statements to
1090    be mixed.  To convert a rule like this:
1091
1092 \begin{verbatim}
1093 rule R: A B C -> << E1 >>
1094       | X Y Z -> << E2 >>
1095 \end{verbatim}
1096
1097    to Yapps 2.0 form, replace the return value specifiers with return
1098    statements:
1099
1100 \begin{verbatim}
1101 rule R: A B C {{ return E1 }}
1102       | X Y Z {{ return E2 }}
1103 \end{verbatim}
1104
1105  \item Yapps 2.0 does not perform tail recursion elimination.  This
1106    means any recursive rules you write will be turned into recursive
1107    methods in the parser.  The parser will work, but may be slower.
1108    It can be made faster by rewriting recursive rules, using instead
1109    the looping operators \verb|*| and \verb|+| provided in Yapps 2.0.
1110
1111 \end{enumerate}
1112
1113 \mysection{Troubleshooting}
1114
1115 \begin{itemize}
1116  \item A common error is to write a grammar that doesn't have an END
1117    token.  End tokens are needed when it is not clear when to stop
1118    parsing.  For example, when parsing the expression {\tt 3+5}, it is
1119    not clear after reading {\tt 3} whether to treat it as a complete
1120    expression or whether the parser should continue reading.
1121    Therefore the grammar for numeric expressions should include an end
1122    token.  Another example is the grammar for Lisp expressions.  In
1123    Lisp, it is always clear when you should stop parsing, so you do
1124    \emph{not} need an end token.  In fact, it may be more useful not
1125    to have an end token, so that you can read in several Lisp expressions.
1126  \item If there is a chance of ambiguity, make sure to put the choices
1127    in the order you want them checked.  Usually the most specific
1128    choice should be first.  Empty sequences should usually be last.
1129  \item The context sensitive scanner is not appropriate for all
1130    grammars.  You might try using the insensitive scanner with the
1131    {\tt context-insensitive-scanner} option in the grammar.
1132  \item If performance turns out to be a problem, try writing a custom
1133    scanner.  The Yapps scanner is rather slow (but flexible and easy
1134    to understand).
1135 \end{itemize}
1136
1137 \mysection{History}
1138
1139 Yapps 1 had several limitations that bothered me while writing
1140 parsers:
1141
1142 \begin{enumerate}
1143  \item It was not possible to insert statements into the generated
1144    parser.  A common workaround was to write an auxilliary function
1145    that executed those statements, and to call that function as part
1146    of the return value calculation.  For example, several of my
1147    parsers had an ``append(x,y)'' function that existed solely to call
1148    ``x.append(y)''.
1149  \item The way in which grammars were specified was rather
1150    restrictive: a rule was a choice of clauses.  Each clause was a
1151    sequence of tokens and rule names, followed by a return value.
1152  \item Optional matching had to be put into a separate rule because
1153    choices were only made at the beginning of a rule.
1154  \item Repetition had to be specified in terms of recursion.  Not only
1155    was this awkward (sometimes requiring additional rules), I had to
1156    add a tail recursion optimization to Yapps to transform the
1157    recursion back into a loop.
1158 \end{enumerate}
1159
1160 Yapps 2 addresses each of these limitations.
1161
1162 \begin{enumerate}
1163  \item Statements can occur anywhere within a rule.  (However, only
1164    one-line statements are allowed; multiline blocks marked by
1165    indentation are not.)
1166  \item Grammars can be specified using any mix of sequences, choices,
1167    tokens, and rule names.  To allow for complex structures,
1168    parentheses can be used for grouping.
1169  \item Given choices and parenthesization, optional matching can be
1170    expressed as a choice between some pattern and nothing.  In
1171    addition, Yapps 2 has the convenience syntax \verb|[A B ...]| for
1172    matching \verb|A B ...| optionally.
1173  \item Repetition operators \verb|*| for zero or more and \verb|+| for
1174    one or more make it easy to specify repeating patterns.
1175 \end{enumerate}
1176
1177 It is my hope that Yapps 2 will be flexible enough to meet my needs
1178 for another year, yet simple enough that I do not hesitate to use it.
1179
1180 \mysection{Future Extensions}
1181 \label{sec:future}
1182
1183 I am still investigating the possibility of LL(2) and higher
1184 lookahead.  However, it looks like the resulting parsers will be
1185 somewhat ugly.
1186
1187 It would be nice to control choices with user-defined predicates.
1188
1189 The most likely future extension is backtracking.  A grammar pattern
1190 like \verb|(VAR ':=' expr)? {{ return Assign(VAR,expr) }} : expr {{ return expr }}|
1191 would turn into code that attempted to match \verb|VAR ':=' expr|.  If
1192 it succeeded, it would run \verb|{{ return ... }}|.  If it failed, it
1193 would match \verb|expr {{ return expr }}|.  Backtracking may make it
1194 less necessary to write LL(2) grammars.
1195
1196 \mysection{References}
1197
1198 \begin{enumerate}
1199  \item The \htmladdnormallink{Python-Parser
1200      SIG}{http://www.python.org/sigs/parser-sig/} is the first place
1201    to look for a list of parser systems for Python.
1202
1203   \item ANTLR/PCCTS, by Terrence Parr, is available at
1204   \htmladdnormallink{The ANTLR Home Page}{http://www.antlr.org/}.
1205
1206   \item PyLR, by Scott Cotton, is at \htmladdnormallink{his Starship
1207     page}{http://starship.skyport.net/crew/scott/PyLR.html}.
1208
1209   \item John Aycock's \htmladdnormallink{Compiling Little Languages
1210       Framework}{http://www.csr.UVic.CA/~aycock/python/}.
1211
1212   \item PyBison, by Scott Hassan, can be found at
1213   \htmladdnormallink{his Python Projects
1214     page}{http://coho.stanford.edu/\~{}hassan/Python/}.
1215
1216   \item mcf.pars, by Mike C. Fletcher, is available at
1217   \htmladdnormallink{his web
1218     page}{http://www.golden.net/\~{}mcfletch/programming/}.
1219
1220   \item kwParsing, by Aaron Watters, is available at
1221   \htmladdnormallink{his Starship
1222     page}{http://starship.skyport.net/crew/aaron_watters/kwParsing/}.
1223 \end{enumerate}
1224
1225 \end{document}