| [/ |
| Copyright 2006-2007 John Maddock. |
| Distributed under the Boost Software License, Version 1.0. |
| (See accompanying file LICENSE_1_0.txt or copy at |
| http://www.boost.org/LICENSE_1_0.txt). |
| ] |
| |
| [section Introduction and Overview] |
| |
| Regular expressions are a form of pattern-matching that are often used in |
| text processing; many users will be familiar with the Unix utilities grep, sed |
| and awk, and the programming language Perl, each of which make extensive use |
| of regular expressions. Traditionally C++ users have been limited to the |
| POSIX C API's for manipulating regular expressions, and while Boost.Regex does |
| provide these API's, they do not represent the best way to use the library. |
| For example Boost.Regex can cope with wide character strings, or search and |
| replace operations (in a manner analogous to either sed or Perl), something |
| that traditional C libraries can not do. |
| |
| The class [basic_regex] is the key class in this library; it represents a |
| "machine readable" regular expression, and is very closely modeled on |
| `std::basic_string`, think of it as a string plus the actual state-machine |
| required by the regular expression algorithms. Like `std::basic_string` there |
| are two typedefs that are almost always the means by which this class is referenced: |
| |
| namespace boost{ |
| |
| template <class charT, |
| class traits = regex_traits<charT> > |
| class basic_regex; |
| |
| typedef basic_regex<char> regex; |
| typedef basic_regex<wchar_t> wregex; |
| |
| } |
| |
| To see how this library can be used, imagine that we are writing a credit |
| card processing application. Credit card numbers generally come as a string |
| of 16-digits, separated into groups of 4-digits, and separated by either a |
| space or a hyphen. Before storing a credit card number in a database |
| (not necessarily something your customers will appreciate!), we may want to |
| verify that the number is in the correct format. To match any digit we could |
| use the regular expression [0-9], however ranges of characters like this are |
| actually locale dependent. Instead we should use the POSIX standard |
| form \[\[:digit:\]\], or the Boost.Regex and Perl shorthand for this \\d (note |
| that many older libraries tended to be hard-coded to the C-locale, |
| consequently this was not an issue for them). That leaves us with the |
| following regular expression to validate credit card number formats: |
| |
| [pre (\d{4}[- ]){3}\d{4}] |
| |
| Here the parenthesis act to group (and mark for future reference) |
| sub-expressions, and the {4} means "repeat exactly 4 times". This is an |
| example of the extended regular expression syntax used by Perl, awk and egrep. |
| Boost.Regex also supports the older "basic" syntax used by sed and grep, |
| but this is generally less useful, unless you already have some basic regular |
| expressions that you need to reuse. |
| |
| Now let's take that expression and place it in some C++ code to validate the |
| format of a credit card number: |
| |
| bool validate_card_format(const std::string& s) |
| { |
| static const boost::regex e("(\\d{4}[- ]){3}\\d{4}"); |
| return regex_match(s, e); |
| } |
| |
| Note how we had to add some extra escapes to the expression: remember that |
| the escape is seen once by the C++ compiler, before it gets to be seen by |
| the regular expression engine, consequently escapes in regular expressions |
| have to be doubled up when embedding them in C/C++ code. Also note that |
| all the examples assume that your compiler supports argument-dependent-lookup |
| lookup, if yours doesn't (for example VC6), then you will have to add some |
| `boost::` prefixes to some of the function calls in the examples. |
| |
| Those of you who are familiar with credit card processing, will have realized |
| that while the format used above is suitable for human readable card numbers, |
| it does not represent the format required by online credit card systems; these |
| require the number as a string of 16 (or possibly 15) digits, without any |
| intervening spaces. What we need is a means to convert easily between the two |
| formats, and this is where search and replace comes in. Those who are familiar |
| with the utilities sed and Perl will already be ahead here; we need two |
| strings - one a regular expression - the other a "format string" that provides |
| a description of the text to replace the match with. In Boost.Regex this |
| search and replace operation is performed with the algorithm [regex_replace], |
| for our credit card example we can write two algorithms like this to |
| provide the format conversions: |
| |
| // match any format with the regular expression: |
| const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); |
| const std::string machine_format("\\1\\2\\3\\4"); |
| const std::string human_format("\\1-\\2-\\3-\\4"); |
| |
| std::string machine_readable_card_number(const std::string s) |
| { |
| return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed); |
| } |
| |
| std::string human_readable_card_number(const std::string s) |
| { |
| return regex_replace(s, e, human_format, boost::match_default | boost::format_sed); |
| } |
| |
| Here we've used marked sub-expressions in the regular expression to split out |
| the four parts of the card number as separate fields, the format string then |
| uses the sed-like syntax to replace the matched text with the reformatted version. |
| |
| In the examples above, we haven't directly manipulated the results of |
| a regular expression match, however in general the result of a match contains |
| a number of sub-expression matches in addition to the overall match. When the |
| library needs to report a regular expression match it does so using an instance |
| of the class [match_results], as before there are typedefs of this class for |
| the most common cases: |
| |
| namespace boost{ |
| |
| typedef match_results<const char*> cmatch; |
| typedef match_results<const wchar_t*> wcmatch; |
| typedef match_results<std::string::const_iterator> smatch; |
| typedef match_results<std::wstring::const_iterator> wsmatch; |
| |
| } |
| |
| The algorithms [regex_search] and [regex_match] make use of [match_results] |
| to report what matched; the difference between these algorithms is that |
| [regex_match] will only find matches that consume /all/ of the input text, |
| where as [regex_search] will search for a match anywhere within the text being matched. |
| |
| Note that these algorithms are not restricted to searching regular C-strings, |
| any bidirectional iterator type can be searched, allowing for the |
| possibility of seamlessly searching almost any kind of data. |
| |
| For search and replace operations, in addition to the algorithm [regex_replace] |
| that we have already seen, the [match_results] class has a `format` member that |
| takes the result of a match and a format string, and produces a new string |
| by merging the two. |
| |
| For iterating through all occurences of an expression within a text, |
| there are two iterator types: [regex_iterator] will enumerate over the |
| [match_results] objects found, while [regex_token_iterator] will enumerate |
| a series of strings (similar to perl style split operations). |
| |
| For those that dislike templates, there is a high level wrapper class |
| [RegEx] that is an encapsulation of the lower level template code - it |
| provides a simplified interface for those that don't need the full |
| power of the library, and supports only narrow characters, and the |
| "extended" regular expression syntax. This class is now deprecated as |
| it does not form part of the regular expressions C++ standard library proposal. |
| |
| The POSIX API functions: [regcomp], [regexec], [regfree] and [regerr], |
| are available in both narrow character and Unicode versions, and are |
| provided for those who need compatibility with these API's. |
| |
| Finally, note that the library now has |
| [link boost_regex.background_information.locale run-time localization support], |
| and recognizes the full POSIX regular expression syntax - including |
| advanced features like multi-character collating elements and equivalence |
| classes - as well as providing compatibility with other regular expression |
| libraries including GNU and BSD4 regex packages, PCRE and Perl 5. |
| |
| [endsect] |
| |
| |