| [/============================================================================== |
| Copyright (C) 2001-2010 Joel de Guzman |
| Copyright (C) 2001-2010 Hartmut Kaiser |
| |
| Distributed under the Boost Software License, Version 1.0. (See accompanying |
| file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) |
| ===============================================================================/] |
| |
| [section:lexer_token_values About Tokens and Token Values] |
| |
| As already discussed, lexical scanning is the process of analyzing the stream |
| of input characters and separating it into strings called tokens, most of the |
| time separated by whitespace. The different token types recognized by a lexical |
| analyzer often get assigned unique integer token identifiers (token ids). These |
| token ids are normally used by the parser to identify the current token without |
| having to look at the matched string again. The __lex__ library is not |
| different with respect to this, as it uses the token ids as the main means of |
| identification of the different token types defined for a particular lexical |
| analyzer. However, it is different from commonly used lexical analyzers in the |
| sense that it returns (references to) instances of a (user defined) token class |
| to the user. The only limitation of this token class is that it must carry at |
| least the token id of the token it represents. For more information about the |
| interface a user defined token type has to expose please look at the |
| __sec_ref_lex_token__ reference. The library provides a default |
| token type based on the __lexertl__ library which should be sufficient in most |
| cases: the __class_lexertl_token__ type. This section focusses on the |
| description of general features a token class may implement and how this |
| integrates with the other parts of the __lex__ library. |
| |
| [heading The Anatomy of a Token] |
| |
| It is very important to understand the difference between a token definition |
| (represented by the __class_token_def__ template) and a token itself (for |
| instance represented by the __class_lexertl_token__ template). |
| |
| The token definition is used to describe the main features of a particular |
| token type, especially: |
| |
| * to simplify the definition of a token type using a regular expression pattern |
| applied while matching this token type, |
| * to associate a token type with a particular lexer state, |
| * to optionally assign a token id to a token type, |
| * to optionally associate some code to execute whenever an instance of this |
| token type has been matched, |
| * and to optionally specify the attribute type of the token value. |
| |
| The token itself is a data structure returned by the lexer iterators. |
| Dereferencing a lexer iterator returns a reference to the last matched token |
| instance. It encapsulates the part of the underlying input sequence matched by |
| the regular expression used during the definition of this token type. |
| Incrementing the lexer iterator invokes the lexical analyzer to |
| match the next token by advancing the underlying input stream. The token data |
| structure contains at least the token id of the matched token type, |
| allowing to identify the matched character sequence. Optionally, the token |
| instance may contain a token value and/or the lexer state this token instance |
| was matched in. The following [link spirit.lex.tokenstructure figure] shows the |
| schematic structure of a token. |
| |
| [fig tokenstructure.png..The structure of a token..spirit.lex.tokenstructure] |
| |
| The token value and the lexer state the token has been recognized in may be |
| omitted for optimization reasons, thus avoiding the need for the token to carry |
| more data than actually required. This configuration can be achieved by supplying |
| appropriate template parameters for the |
| __class_lexertl_token__ template while defining the token type. |
| |
| The lexer iterator returns the same token type for each of the different |
| matched token definitions. To accommodate for the possible different token |
| /value/ types exposed by the various token types (token definitions), the |
| general type of the token value is a __boost_variant__. At a minimum (for the |
| default configuration) this token value variant will be configured to always |
| hold a __boost_iterator_range__ containing the pair of iterators pointing to |
| the matched input sequence for this token instance. |
| |
| [note If the lexical analyzer is used in conjunction with a __qi__ parser, the |
| stored __boost_iterator_range__ token value will be converted to the |
| requested token type (parser attribute) exactly once. This happens at the |
| time of the first access to the token value requiring the |
| corresponding type conversion. The converted token value will be stored |
| in the __boost_variant__ replacing the initially stored iterator range. |
| This avoids having to convert the input sequence to the token value more |
| than once, thus optimizing the integration of the lexer with __qi__, even |
| during parser backtracking. |
| ] |
| |
| Here is the template prototype of the __class_lexertl_token__ template: |
| |
| template < |
| typename Iterator = char const*, |
| typename AttributeTypes = mpl::vector0<>, |
| typename HasState = mpl::true_ |
| > |
| struct lexertl_token; |
| |
| [variablelist where: |
| [[Iterator] [This is the type of the iterator used to access the |
| underlying input stream. It defaults to a plain |
| `char const*`.]] |
| [[AttributeTypes] [This is either a mpl sequence containing all |
| attribute types used for the token definitions or the |
| type `omit`. If the mpl sequence is empty (which is |
| the default), all token instances will store a |
| __boost_iterator_range__`<Iterator>` pointing to the start |
| and the end of the matched section in the input stream. |
| If the type is `omit`, the generated tokens will |
| contain no token value (attribute) at all.]] |
| [[HasState] [This is either `mpl::true_` or `mpl::false_`, allowing |
| control as to whether the generated token instances will |
| contain the lexer state they were generated in. The |
| default is mpl::true_, so all token instances will |
| contain the lexer state.]] |
| ] |
| |
| Normally, during construction, a token instance always holds the |
| __boost_iterator_range__ as its token value, unless it has been defined |
| using the `omit` token value type. This iterator range then is |
| converted in place to the requested token value type (attribute) when it is |
| requested for the first time. |
| |
| |
| [heading The Physiognomy of a Token Definition] |
| |
| The token definitions (represented by the __class_token_def__ template) are |
| normally used as part of the definition of the lexical analyzer. At the same |
| time a token definition instance may be used as a parser component in __qi__. |
| |
| The template prototype of this class is shown here: |
| |
| template< |
| typename Attribute = unused_type, |
| typename Char = char |
| > |
| class token_def; |
| |
| [variablelist where: |
| [[Attribute] [This is the type of the token value (attribute) |
| supported by token instances representing this token |
| type. This attribute type is exposed to the __qi__ |
| library, whenever this token definition is used as a |
| parser component. The default attribute type is |
| `unused_type`, which means the token instance holds a |
| __boost_iterator_range__ pointing to the start |
| and the end of the matched section in the input stream. |
| If the attribute is `omit` the token instance will |
| expose no token type at all. Any other type will be |
| used directly as the token value type.]] |
| [[Char] [This is the value type of the iterator for the |
| underlying input sequence. It defaults to `char`.]] |
| ] |
| |
| The semantics of the template parameters for the token type and the token |
| definition type are very similar and interdependent. As a rule of thumb you can |
| think of the token definition type as the means of specifying everything |
| related to a single specific token type (such as `identifier` or `integer`). |
| On the other hand the token type is used to define the general properties of all |
| token instances generated by the __lex__ library. |
| |
| [important If you don't list any token value types in the token type definition |
| declaration (resulting in the usage of the default __boost_iterator_range__ |
| token type) everything will compile and work just fine, just a bit |
| less efficient. This is because the token value will be converted |
| from the matched input sequence every time it is requested. |
| |
| But as soon as you specify at least one token value type while |
| defining the token type you'll have to list all value types used for |
| __class_token_def__ declarations in the token definition class, |
| otherwise compilation errors will occur. |
| ] |
| |
| |
| [heading Examples of using __class_lexertl_token__] |
| |
| Let's start with some examples. We refer to one of the __lex__ examples (for |
| the full source code of this example please see |
| [@../../example/lex/example4.cpp example4.cpp]). |
| |
| [import ../example/lex/example4.cpp] |
| |
| The first code snippet shows an excerpt of the token definition class, the |
| definition of a couple of token types. Some of the token types do not expose a |
| special token value (`if_`, `else_`, and `while_`). Their token value will |
| always hold the iterator range of the matched input sequence. The token |
| definitions for the `identifier` and the integer `constant` are specialized |
| to expose an explicit token type each: `std::string` and `unsigned int`. |
| |
| [example4_token_def] |
| |
| As the parsers generated by __qi__ are fully attributed, any __qi__ parser |
| component needs to expose a certain type as its parser attribute. Naturally, |
| the __class_token_def__ exposes the token value type as its parser attribute, |
| enabling a smooth integration with __qi__. |
| |
| The next code snippet demonstrates how the required token value types are |
| specified while defining the token type to use. All of the token value types |
| used for at least one of the token definitions have to be re-iterated for the |
| token definition as well. |
| |
| [example4_token] |
| |
| To avoid the token to have a token value at all, the special tag `omit` can |
| be used: `token_def<omit>` and `lexertl_token<base_iterator_type, omit>`. |
| |
| |
| |
| |
| |
| |
| [endsect] |