| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> |
| <title>Introduction to Spirit.Lex</title> |
| <link rel="stylesheet" href="../../../../../../doc/src/boostbook.css" type="text/css"> |
| <meta name="generator" content="DocBook XSL Stylesheets V1.75.0"> |
| <link rel="home" href="../../index.html" title="Spirit 2.4.1"> |
| <link rel="up" href="../lex.html" title="Lex - Writing Lexical Analyzers"> |
| <link rel="prev" href="../lex.html" title="Lex - Writing Lexical Analyzers"> |
| <link rel="next" href="tutorials.html" title="Spirit.Lex Tutorials"> |
| </head> |
| <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> |
| <table cellpadding="2" width="100%"><tr> |
| <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../../boost.png"></td> |
| <td align="center"><a href="../../../../../../index.html">Home</a></td> |
| <td align="center"><a href="../../../../../../libs/libraries.htm">Libraries</a></td> |
| <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> |
| <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> |
| <td align="center"><a href="../../../../../../more/index.htm">More</a></td> |
| </tr></table> |
| <hr> |
| <div class="spirit-nav"> |
| <a accesskey="p" href="../lex.html"><img src="../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../lex.html"><img src="../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="tutorials.html"><img src="../../../../../../doc/src/images/next.png" alt="Next"></a> |
| </div> |
| <div class="section"> |
| <div class="titlepage"><div><div><h3 class="title"> |
| <a name="spirit.lex.lexer_introduction"></a><a class="link" href="lexer_introduction.html" title="Introduction to Spirit.Lex">Introduction to <span class="emphasis"><em>Spirit.Lex</em></span></a> |
| </h3></div></div></div> |
| <p> |
| Lexical scanning is the process of analyzing the stream of input characters |
| and separating it into strings called tokens, separated by whitespace. Most |
| compiler texts start here, and devote several chapters to discussing various |
| ways to build scanners. <span class="emphasis"><em>Spirit.Lex</em></span> is a library built |
| to take care of the complexities of creating a lexer for your grammar (in |
| this documentation we will use the terms 'lexical analyzer', 'lexer' and |
| 'scanner' interchangeably). All that is needed to create a lexer is to know |
| the set of patterns describing the different tokens you want to recognize |
| in the input. To make this a bit more formal, here are some definitions: |
| </p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"> |
| <li class="listitem"> |
| A token is a sequence of consecutive characters having a collective meaning. |
| Tokens may have attributes specific to the token type, carrying additional |
| information about the matched character sequence. |
| </li> |
| <li class="listitem"> |
| A pattern is a rule expressed as a regular expression and describing |
| how a particular token can be formed. For example, <code class="literal">[A-Za-z][A-Za-z_0-9]*</code> |
| is a pattern for a rule matching C++ identifiers. |
| </li> |
| <li class="listitem"> |
| Characters between tokens are called whitespace; these include spaces, |
| tabs, newlines, and formfeeds. Many people also count comments as whitespace, |
| though since some tools such as lint look at comments, this method is |
| not perfect. |
| </li> |
| </ul></div> |
| <a name="spirit.lex.lexer_introduction.why_use_a_separate_lexer_"></a><h5> |
| <a name="id1115090"></a> |
| <a class="link" href="lexer_introduction.html#spirit.lex.lexer_introduction.why_use_a_separate_lexer_">Why |
| Use a Separate Lexer?</a> |
| </h5> |
| <p> |
| Typically, lexical scanning is done in a separate module from the parser, |
| feeding the parser with a stream of input tokens only. Theoretically it is |
| not necessary implement this separation as in the end there is only one set |
| of syntactical rules defining the language, so in theory we could write the |
| whole parser in one module. In fact, <span class="emphasis"><em>Spirit.Qi</em></span> allows |
| you to write parsers without using a lexer, parsing the input character stream |
| directly, and for the most part this is the way <a href="http://boost-spirit.com" target="_top">Spirit</a> |
| has been used since its invention. |
| </p> |
| <p> |
| However, this separation has both practical and theoretical basis, and proves |
| to be very useful in practical applications. In 1956, Noam Chomsky defined |
| the "Chomsky Hierarchy" of grammars: |
| </p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"> |
| <li class="listitem"> |
| Type 0: Unrestricted grammars (e.g., natural languages) |
| </li> |
| <li class="listitem"> |
| Type 1: Context-Sensitive grammars |
| </li> |
| <li class="listitem"> |
| Type 2: Context-Free grammars |
| </li> |
| <li class="listitem"> |
| Type 3: Regular grammars |
| </li> |
| </ul></div> |
| <p> |
| The complexity of these grammars increases from regular grammars being the |
| simplest to unrestricted grammars being the most complex. Similarly, the |
| complexity of pattern recognition for these grammars increases. Although, |
| a few features of some programming languages (such as C++) are Type 1, fortunately |
| for the most part programming languages can be described using only the Types |
| 2 and 3. The neat part about these two types is that they are well known |
| and the ways to parse them are well understood. It has been shown that any |
| regular grammar can be parsed using a state machine (finite automaton). Similarly, |
| context-free grammars can always be parsed using a push-down automaton (essentially |
| a state machine augmented by a stack). |
| </p> |
| <p> |
| In real programming languages and practical grammars, the parts that can |
| be handled as regular expressions tend to be the lower-level pieces, such |
| as the definition of an identifier or of an integer value: |
| </p> |
| <pre class="programlisting"><span class="identifier">letter</span> <span class="special">:=</span> <span class="special">[</span><span class="identifier">a</span><span class="special">-</span><span class="identifier">zA</span><span class="special">-</span><span class="identifier">Z</span><span class="special">]</span> |
| <span class="identifier">digit</span> <span class="special">:=</span> <span class="special">[</span><span class="number">0</span><span class="special">-</span><span class="number">9</span><span class="special">]</span> |
| |
| <span class="identifier">identifier</span> <span class="special">:=</span> <span class="identifier">letter</span> <span class="special">[</span> <span class="identifier">letter</span> <span class="special">|</span> <span class="identifier">digit</span> <span class="special">]*</span> |
| <span class="identifier">integer</span> <span class="special">:=</span> <span class="identifier">digit</span><span class="special">+</span> |
| </pre> |
| <p> |
| Higher level parts of practical grammars tend to be more complex and can't |
| be implemented using plain regular expressions. We need to store information |
| on the built-in hardware stack while recursing the grammar hierarchy, and |
| that is the preferred approach used for top-down parsing. Since it takes |
| a different kind of abstract machine to parse the two types of grammars, |
| it proved to be efficient to separate the lexical scanner into a separate |
| module which is built around the idea of a state machine. The goal here is |
| to use the simplest parsing technique needed for the job. |
| </p> |
| <p> |
| Another, more practical, reason for separating the scanner from the parser |
| is the need for backtracking during parsing. The input data is a stream of |
| characters, which is often thought to be processed left to right without |
| any backtracking. Unfortunately, in practice most of the time that isn't |
| possible. Almost every language has certain keywords such as IF, FOR, and |
| WHILE. The decision if a certain character sequence actually comprises a |
| keyword or just an identifier often can be made only after seeing the first |
| delimiter <span class="emphasis"><em>after</em></span> it. In fact, this makes the process |
| backtracking, since we need to store the string long enough to be able to |
| make the decision. The same is true for more coarse grained language features |
| such as nested IF/ELSE statements, where the decision about to which IF belongs |
| the last ELSE statement can be made only after seeing the whole construct. |
| </p> |
| <p> |
| So the structure of a conventional compiler often involves splitting up the |
| functions of the lower-level and higher-level parsing. The lexical scanner |
| deals with things at the character level, collecting characters into strings, |
| converting character sequence into different representations as integers, |
| etc., and passing them along to the parser proper as indivisible tokens. |
| It's also considered normal to let the scanner do additional jobs, such as |
| identifying keywords, storing identifiers in tables, etc. |
| </p> |
| <p> |
| Now, <a href="http://boost-spirit.com" target="_top">Spirit</a> follows this structure, |
| where <span class="emphasis"><em>Spirit.Lex</em></span> can be used to implement state machine |
| based recognizers, while <span class="emphasis"><em>Spirit.Qi</em></span> can be used to build |
| recognizers for context free grammars. Since both modules are seamlessly |
| integrated with each other and with the C++ target language it is even possible |
| to use the provided functionality to build more complex grammar recognizers. |
| </p> |
| <a name="spirit.lex.lexer_introduction.advantages_of_using__emphasis_spirit_lex__emphasis_"></a><h5> |
| <a name="id1115314"></a> |
| <a class="link" href="lexer_introduction.html#spirit.lex.lexer_introduction.advantages_of_using__emphasis_spirit_lex__emphasis_">Advantages |
| of using <span class="emphasis"><em>Spirit.Lex</em></span></a> |
| </h5> |
| <p> |
| The advantage of using <span class="emphasis"><em>Spirit.Lex</em></span> to create the lexical |
| analyzer over using more traditional tools such as <a href="http://flex.sourceforge.net/" target="_top">Flex</a> |
| is its carefully crafted integration with the <a href="http://boost-spirit.com" target="_top">Spirit</a> |
| library and the C++ host language. You don't need any external tools to generate |
| the code, your lexer will be perfectly integrated with the rest of your program, |
| making it possible to freely access any context information and data structure. |
| Since the C++ compiler sees all the code, it will generate optimal code no |
| matter what configuration options have been chosen by the user. <span class="emphasis"><em>Spirit.Lex</em></span> |
| gives you the vast majority of features you could get from a similar <a href="http://flex.sourceforge.net/" target="_top">Flex</a> program without the need |
| to leave C++ as a host language: |
| </p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"> |
| <li class="listitem"> |
| The definition of tokens is done using regular expressions (patterns) |
| </li> |
| <li class="listitem"> |
| The token definitions can refer to special substitution strings (pattern |
| macros) simplifying pattern definitions |
| </li> |
| <li class="listitem"> |
| The generated lexical scanner may have multiple start states |
| </li> |
| <li class="listitem"> |
| It is possible to attach code to any of the token definitions; this code |
| gets executed whenever the corresponding token pattern has been matched |
| </li> |
| </ul></div> |
| <p> |
| Even if it is possible to use <span class="emphasis"><em>Spirit.Lex</em></span> to generate |
| C++ code representing the lexical analyzer (we will refer to that as the |
| <span class="emphasis"><em>static</em></span> model, described in more detail in the section |
| <a class="link" href="abstracts/lexer_static_model.html" title="The Static Lexer Model">The <span class="emphasis"><em>Static</em></span> |
| Model</a>) - a model very similar to the way <a href="http://flex.sourceforge.net/" target="_top">Flex</a> |
| operates - we will mainly focus on the opposite, the <span class="emphasis"><em>dynamic</em></span> |
| model. You can directly integrate the token definitions into your C++ program, |
| building the lexical analyzer dynamically at runtime. The dynamic model is |
| something not supported by <a href="http://flex.sourceforge.net/" target="_top">Flex</a> |
| or other lexical scanner generators (such as <a href="http://re2c.sourceforge.net/" target="_top">re2c</a>, |
| <a href="http://www.cs.queensu.ca/~thurston/ragel/" target="_top">Ragel</a>, etc.). |
| This dynamic flexibility allows you to speed up the development of your application. |
| </p> |
| <a name="spirit.lex.lexer_introduction.the_library_structure_of__emphasis_spirit_lex__emphasis_"></a><h5> |
| <a name="id1115432"></a> |
| <a class="link" href="lexer_introduction.html#spirit.lex.lexer_introduction.the_library_structure_of__emphasis_spirit_lex__emphasis_">The |
| Library Structure of <span class="emphasis"><em>Spirit.Lex</em></span></a> |
| </h5> |
| <p> |
| The <a class="link" href="lexer_introduction.html#spirit.lexerflow" title="Figure 6. The Library structure and Common Flow of Information while using Spirit.Lex in an application">figure</a> below shows a high level |
| overview of how the <span class="emphasis"><em>Spirit.Lex</em></span> library might be used |
| in an application. <span class="emphasis"><em>Spirit.Lex</em></span> allows to create lexical |
| analyzers based on patterns. These patterns are regular expression based |
| rules used to define the different tokens to be recognized in the character |
| input sequence. The input sequence is expected to be provided to the lexical |
| analyzer as an arbitrary standard forward iterator. The lexical analyzer |
| itself exposes a standard forward iterator as well. The difference here is |
| that the exposed iterator provides access to the token sequence instead of |
| to the character sequence. The tokens in this sequence are constructed on |
| the fly by analyzing the underlying character sequence and matching this |
| to the patterns as defined by the application. |
| </p> |
| <p> |
| </p> |
| <div class="figure"> |
| <a name="spirit.lexerflow"></a><p class="title"><b>Figure 6. The Library structure and Common Flow of Information while |
| using <span class="emphasis"><em>Spirit.Lex</em></span> in an application</b></p> |
| <div class="figure-contents"><span class="inlinemediaobject"><img src="../.././images/lexerflow.png" alt="The Library structure and Common Flow of Information while using Spirit.Lex in an application"></span></div> |
| </div> |
| <p><br class="figure-break"> |
| </p> |
| </div> |
| <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> |
| <td align="left"></td> |
| <td align="right"><div class="copyright-footer">Copyright © 2001-2010 Joel de Guzman, Hartmut Kaiser<p> |
| Distributed under the Boost Software License, Version 1.0. (See accompanying |
| file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) |
| </p> |
| </div></td> |
| </tr></table> |
| <hr> |
| <div class="spirit-nav"> |
| <a accesskey="p" href="../lex.html"><img src="../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../lex.html"><img src="../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="tutorials.html"><img src="../../../../../../doc/src/images/next.png" alt="Next"></a> |
| </div> |
| </body> |
| </html> |