| [/============================================================================== |
| Copyright (C) 2001-2010 Joel de Guzman |
| Copyright (C) 2001-2010 Hartmut Kaiser |
| |
| Distributed under the Boost Software License, Version 1.0. (See accompanying |
| file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) |
| ===============================================================================/] |
| |
| [section Introduction] |
| |
| Boost Spirit is an object-oriented, recursive-descent parser and |
| output generation library for C++. It allows you to write grammars and |
| format descriptions using a format similar to Extended Backus Naur |
| Form (EBNF)[footnote [@http://www.cl.cam.ac.uk/%7Emgk25/iso-14977.pdf |
| ISO-EBNF]] directly in C++. These inline grammar |
| specifications can mix freely with other C++ code and, thanks to the |
| generative power of C++ templates, are immediately executable. In |
| retrospect, conventional compiler-compilers or parser-generators have |
| to perform an additional translation step from the source EBNF code to |
| C or C++ code. |
| |
| The syntax and semantics of the libraries' API directly form domain-specific |
| embedded languages (DSEL). In fact, Spirit exposes 3 different DSELs to the |
| user: |
| |
| * one for creating parser grammars, |
| * one for the specification of the required tokens to be used for parsing, |
| * and one for the description of the required output formats. |
| |
| Since the target input grammars and output formats are written entirely in C++ |
| we do not need any separate tools to compile, preprocess or integrate those |
| into the build process. __spirit__ allows seamless integration of the parsing |
| and output generation process with other C++ code. This often allows for |
| simpler and more efficient code. |
| |
| Both the created parsers and generators are fully attributed, which allows you |
| to easily build and handle hierarchical data structures in memory. These data |
| structures resemble the structure of the input data and can directly be used |
| to generate arbitrarily-formatted output. |
| |
| The [link spirit.spiritstructure figure] below depicts the overall structure |
| of the Boost Spirit library. The library consists of 4 major parts: |
| |
| * __classic__: This is the almost-unchanged code base taken from the |
| former Boost Spirit V1.8 distribution. It has been moved into the namespace |
| boost::spirit::classic. A special compatibility layer has been added to |
| ensure complete compatibility with existing code using Spirit V1.8. |
| * __qi__: This is the parser library allowing you to build recursive |
| descent parsers. The exposed domain-specific language can be used to describe |
| the grammars to implement, and the rules for storing the parsed information. |
| * __lex__: This is the library usable to create tokenizers (lexers). The |
| domain-specific language exposed by __lex__ allows you to define regular |
| expressions used to match tokens (create token definitions), associate these |
| regular expressions with code to be executed whenever they are matched, and |
| to add the token definitions to the lexical analyzer. |
| * __karma__: This is the generator library allowing you to create code for |
| recursive descent, data type-driven output formatting. The exposed |
| domain-specific language is almost equivalent to the parser description language |
| used in __qi__, except that it is used to describe the required output |
| format to generate from a given data structure. |
| |
| [fig spiritstructure.png..The overall structure of the Boost Spirit library..spirit.spiritstructure] |
| |
| |
| The three components, __qi__, __karma__ and __lex__, are designed to be used |
| either stand alone, or together. The general methodology is to use the token |
| sequence generated by __lex__ as the input for a parser generated by __qi__. |
| On the opposite side of the equation, the hierarchical data structures generated |
| by __qi__ are used for the output generators created using __karma__. |
| However, there is nothing to stop you from using any of these components all |
| by themselves. |
| |
| The [link spirit.spiritkarmaflow figure] below shows the typical data flow of |
| some input being converted to some internal representation. After some |
| (optional) transformation these data are converted back into some different, |
| external representation. The picture highlights Spirit's place in this data |
| transformation flow. |
| |
| [fig spiritkarmaflow.png..The place of __qi__ and __karma__ in a data transformation flow of a typical application..spirit.spiritkarmaflow] |
| |
| [heading A Quick Overview of Parsing with __qi__] |
| |
| __qi__ is Spirit's sublibrary dealing with generating parsers based on a given |
| target grammar (essentially a format description of the input data to read). |
| |
| A simple EBNF grammar snippet: |
| |
| group ::= '(' expression ')' |
| factor ::= integer | group |
| term ::= factor (('*' factor) | ('/' factor))* |
| expression ::= term (('+' term) | ('-' term))* |
| |
| is approximated using facilities of Spirit's /Qi/ sublibrary as seen in this |
| code snippet: |
| |
| group = '(' >> expression >> ')'; |
| factor = integer | group; |
| term = factor >> *(('*' >> factor) | ('/' >> factor)); |
| expression = term >> *(('+' >> term) | ('-' >> term)); |
| |
| Through the magic of expression templates, this is perfectly valid and |
| executable C++ code. The production rule `expression` is, in fact, an object that |
| has a member function `parse` that does the work given a source code written in |
| the grammar that we have just declared. Yes, it's a calculator. We shall |
| simplify for now by skipping the type declarations and the definition of the |
| rule `integer` invoked by `factor`. Now, the production rule `expression` in our |
| grammar specification, traditionally called the `start` symbol, can recognize |
| inputs such as: |
| |
| 12345 |
| -12345 |
| +12345 |
| 1 + 2 |
| 1 * 2 |
| 1/2 + 3/4 |
| 1 + 2 + 3 + 4 |
| 1 * 2 * 3 * 4 |
| (1 + 2) * (3 + 4) |
| (-1 + 2) * (3 + -4) |
| 1 + ((6 * 200) - 20) / 6 |
| (1 + (2 + (3 + (4 + 5)))) |
| |
| Certainly we have modified the original EBNF syntax. This is done to |
| conform to C++ syntax rules. Most notably we see the abundance of |
| shift >> operators. Since there are no 'empty' operators in C++, it is |
| simply not possible to write something like: |
| |
| a b |
| |
| as seen in math syntax, for example, to mean multiplication or, in our case, |
| as seen in EBNF syntax to mean sequencing (b should follow a). __qi__ |
| uses the shift `>>` operator instead for this purpose. We take the `>>` operator, |
| with arrows pointing to the right, to mean "is followed by". Thus we write: |
| |
| a >> b |
| |
| The alternative operator `|` and the parentheses `()` remain as is. The |
| assignment operator `=` is used in place of EBNF's `::=`. Last but not least, |
| the Kleene star `*`, which in this case is a postfix operator in EBNF becomes a |
| prefix. Instead of: |
| |
| a* //... in EBNF syntax, |
| |
| we write: |
| |
| *a //... in Spirit. |
| |
| since there are no postfix stars, `*`, in C/C++. Finally, we terminate each |
| rule with the ubiquitous semi-colon, `;`. |
| |
| |
| [heading A Quick Overview of Output Generation with __karma__] |
| |
| Spirit not only allows you to describe the structure of the input, it also enables |
| the specification of the output format for your data in a similar way, and based |
| on a single syntax and compatible semantics. |
| |
| Let's assume we need to generate a textual representation from a simple data |
| structure such as a `std::vector<int>`. Conventional code probably would look like: |
| |
| std::vector<int> v (initialize_and_fill()); |
| std::vector<int>::iterator end = v.end(); |
| for (std::vector<int>::iterator it = v.begin(); it != end; ++it) |
| std::cout << *it << std::endl; |
| |
| which is not very flexible and quite difficult to maintain when it comes to |
| changing the required output format. Spirit's sublibrary /Karma/ allows you to |
| specify output formats for arbitrary data structures in a very flexible way. |
| The following snippet is the /Karma/ format description used to create the |
| same output as the traditional code above: |
| |
| *(int_ << eol) |
| |
| Here are some more examples of format descriptions for different output |
| representations of the same `std::vector<int>`: |
| |
| [table Different output formats for `std::vector<int>` |
| [ [Format] [Example] [Description] ] |
| [ [`'[' << *(int_ << ',') << ']'`] [`[1,8,10,]`] [Comma separated list of integers] ] |
| [ [`*('(' << int_ << ')' << ',')`] [`(1),(8),(10),`] [Comma separated list of integers in parenthesis] ] |
| [ [`*hex`] [`18a`] [A list of hexadecimal numbers] ] |
| [ [`*(double_ << ',')`] [`1.0,8.0,10.0,`] [A list of floating point numbers] ] |
| ] |
| |
| We will see later in this documentation how it is possible to avoid printing |
| the trailing `','`. |
| |
| Overall, the syntax is similar to __qi__ with the exception that we use the `<<` |
| operator for output concatenation. This should be easy to understand as it |
| follows the conventions used in the Standard's I/O streams. |
| |
| Another important feature of __karma__ allows you to fully decouple the data |
| type from the output format. You can use the same output format with different |
| data types as long as these conform conceptually. The next table gives some |
| related examples. |
| |
| [table Different data types usable with the output format `*(int_ << eol)` |
| [ [Data type] [Description] ] |
| [ [`int i[4]`] [C style arrays] ] |
| [ [`std::vector<int>`] [Standard vector] ] |
| [ [`std::list<int>`] [Standard list] ] |
| [ [`boost::array<long, 20>`] [Boost array] ] |
| ] |
| |
| [endsect] |