| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> |
| <title>Unicode Regular Expression Algorithms</title> |
| <link rel="stylesheet" href="../../../../../../../../doc/src/boostbook.css" type="text/css"> |
| <meta name="generator" content="DocBook XSL Stylesheets V1.74.0"> |
| <link rel="home" href="../../../../index.html" title="Boost.Regex"> |
| <link rel="up" href="../icu.html" title="Working With Unicode and ICU String Types"> |
| <link rel="prev" href="unicode_types.html" title="Unicode regular expression types"> |
| <link rel="next" href="unicode_iter.html" title="Unicode Aware Regex Iterators"> |
| </head> |
| <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> |
| <table cellpadding="2" width="100%"><tr> |
| <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../../../../boost.png"></td> |
| <td align="center"><a href="../../../../../../../../index.html">Home</a></td> |
| <td align="center"><a href="../../../../../../../../libs/libraries.htm">Libraries</a></td> |
| <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> |
| <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> |
| <td align="center"><a href="../../../../../../../../more/index.htm">More</a></td> |
| </tr></table> |
| <hr> |
| <div class="spirit-nav"> |
| <a accesskey="p" href="unicode_types.html"><img src="../../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../icu.html"><img src="../../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../../index.html"><img src="../../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="unicode_iter.html"><img src="../../../../../../../../doc/src/images/next.png" alt="Next"></a> |
| </div> |
| <div class="section" lang="en"> |
| <div class="titlepage"><div><div><h5 class="title"> |
| <a name="boost_regex.ref.non_std_strings.icu.unicode_algo"></a><a class="link" href="unicode_algo.html" title="Unicode Regular Expression Algorithms"> |
| Unicode Regular Expression Algorithms</a> |
| </h5></div></div></div> |
| <p> |
| The regular expression algorithms <a class="link" href="../../regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a>, <a class="link" href="../../regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a> and <a class="link" href="../../regex_replace.html" title="regex_replace"><code class="computeroutput"><span class="identifier">regex_replace</span></code></a> all expect that |
| the character sequence upon which they operate, is encoded in the same |
| character encoding as the regular expression object with which they are |
| used. For Unicode regular expressions that behavior is undesirable: while |
| we may want to process the data in UTF-32 "chunks", the actual |
| data is much more likely to encoded as either UTF-8 or UTF-16. Therefore |
| the header <boost/regex/icu.hpp> provides a series of thin wrappers |
| around these algorithms, called <code class="computeroutput"><span class="identifier">u32regex_match</span></code>, |
| <code class="computeroutput"><span class="identifier">u32regex_search</span></code>, and |
| <code class="computeroutput"><span class="identifier">u32regex_replace</span></code>. These |
| wrappers use iterator-adapters internally to make external UTF-8 or UTF-16 |
| data look as though it's really a UTF-32 sequence, that can then be passed |
| on to the "real" algorithm. |
| </p> |
| <a name="boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_match"></a><h5> |
| <a name="id1024353"></a> |
| <a class="link" href="unicode_algo.html#boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_match">u32regex_match</a> |
| </h5> |
| <p> |
| For each <a class="link" href="../../regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a> |
| algorithm defined by <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>, |
| then <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">/</span><span class="identifier">icu</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code> defines an overloaded algorithm that |
| takes the same arguments, but which is called <code class="computeroutput"><span class="identifier">u32regex_match</span></code>, |
| and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as |
| an ICU UnicodeString as input. |
| </p> |
| <p> |
| Example: match a password, encoded in a UTF-16 UnicodeString: |
| </p> |
| <pre class="programlisting"><span class="comment">// |
| </span><span class="comment">// Find out if *password* meets our password requirements, |
| </span><span class="comment">// as defined by the regular expression *requirements*. |
| </span><span class="comment">// |
| </span><span class="keyword">bool</span> <span class="identifier">is_valid_password</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">password</span><span class="special">,</span> <span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">requirements</span><span class="special">)</span> |
| <span class="special">{</span> |
| <span class="keyword">return</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_match</span><span class="special">(</span><span class="identifier">password</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span><span class="identifier">requirements</span><span class="special">));</span> |
| <span class="special">}</span> |
| </pre> |
| <p> |
| Example: match a UTF-8 encoded filename: |
| </p> |
| <pre class="programlisting"><span class="comment">// |
| </span><span class="comment">// Extract filename part of a path from a UTF-8 encoded std::string and return the result |
| </span><span class="comment">// as another std::string: |
| </span><span class="comment">// |
| </span><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">get_filename</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&</span> <span class="identifier">path</span><span class="special">)</span> |
| <span class="special">{</span> |
| <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span><span class="string">"(?:\\A|.*\\\\)([^\\\\]+)"</span><span class="special">);</span> |
| <span class="identifier">boost</span><span class="special">::</span><span class="identifier">smatch</span> <span class="identifier">what</span><span class="special">;</span> |
| <span class="keyword">if</span><span class="special">(</span><span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_match</span><span class="special">(</span><span class="identifier">path</span><span class="special">,</span> <span class="identifier">what</span><span class="special">,</span> <span class="identifier">r</span><span class="special">))</span> |
| <span class="special">{</span> |
| <span class="comment">// extract $1 as a std::string: |
| </span> <span class="keyword">return</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">str</span><span class="special">(</span><span class="number">1</span><span class="special">);</span> |
| <span class="special">}</span> |
| <span class="keyword">else</span> |
| <span class="special">{</span> |
| <span class="keyword">throw</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">runtime_error</span><span class="special">(</span><span class="string">"Invalid pathname"</span><span class="special">);</span> |
| <span class="special">}</span> |
| <span class="special">}</span> |
| </pre> |
| <a name="boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_search"></a><h5> |
| <a name="id1024896"></a> |
| <a class="link" href="unicode_algo.html#boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_search">u32regex_search</a> |
| </h5> |
| <p> |
| For each <a class="link" href="../../regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a> |
| algorithm defined by <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>, |
| then <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">/</span><span class="identifier">icu</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code> defines an overloaded algorithm that |
| takes the same arguments, but which is called <code class="computeroutput"><span class="identifier">u32regex_search</span></code>, |
| and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as |
| an ICU UnicodeString as input. |
| </p> |
| <p> |
| Example: search for a character sequence in a specific language block: |
| </p> |
| <pre class="programlisting"><span class="identifier">UnicodeString</span> <span class="identifier">extract_greek</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">text</span><span class="special">)</span> |
| <span class="special">{</span> |
| <span class="comment">// searches through some UTF-16 encoded text for a block encoded in Greek, |
| </span> <span class="comment">// this expression is imperfect, but the best we can do for now - searching |
| </span> <span class="comment">// for specific scripts is actually pretty hard to do right. |
| </span> <span class="comment">// |
| </span> <span class="comment">// Here we search for a character sequence that begins with a Greek letter, |
| </span> <span class="comment">// and continues with characters that are either not-letters ( [^[:L*:]] ) |
| </span> <span class="comment">// or are characters in the Greek character block ( [\\x{370}-\\x{3FF}] ). |
| </span> <span class="comment">// |
| </span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span> |
| <span class="identifier">L</span><span class="string">"[\\x{370}-\\x{3FF}](?:[^[:L*:]]|[\\x{370}-\\x{3FF}])*"</span><span class="special">);</span> |
| <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u16match</span> <span class="identifier">what</span><span class="special">;</span> |
| <span class="keyword">if</span><span class="special">(</span><span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_search</span><span class="special">(</span><span class="identifier">text</span><span class="special">,</span> <span class="identifier">what</span><span class="special">,</span> <span class="identifier">r</span><span class="special">))</span> |
| <span class="special">{</span> |
| <span class="comment">// extract $0 as a UnicodeString: |
| </span> <span class="keyword">return</span> <span class="identifier">UnicodeString</span><span class="special">(</span><span class="identifier">what</span><span class="special">[</span><span class="number">0</span><span class="special">].</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">length</span><span class="special">(</span><span class="number">0</span><span class="special">));</span> |
| <span class="special">}</span> |
| <span class="keyword">else</span> |
| <span class="special">{</span> |
| <span class="keyword">throw</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">runtime_error</span><span class="special">(</span><span class="string">"No Greek found!"</span><span class="special">);</span> |
| <span class="special">}</span> |
| <span class="special">}</span> |
| </pre> |
| <a name="boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_replace"></a><h5> |
| <a name="id1025314"></a> |
| <a class="link" href="unicode_algo.html#boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_replace">u32regex_replace</a> |
| </h5> |
| <p> |
| For each <a class="link" href="../../regex_replace.html" title="regex_replace"><code class="computeroutput"><span class="identifier">regex_replace</span></code></a> algorithm defined |
| by <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>, then <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">/</span><span class="identifier">icu</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code> |
| defines an overloaded algorithm that takes the same arguments, but which |
| is called <code class="computeroutput"><span class="identifier">u32regex_replace</span></code>, |
| and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as |
| an ICU UnicodeString as input. The input sequence and the format string |
| specifier passed to the algorithm, can be encoded independently (for |
| example one can be UTF-8, the other in UTF-16), but the result string |
| / output iterator argument must use the same character encoding as the |
| text being searched. |
| </p> |
| <p> |
| Example: Credit card number reformatting: |
| </p> |
| <pre class="programlisting"><span class="comment">// |
| </span><span class="comment">// Take a credit card number as a string of digits, |
| </span><span class="comment">// and reformat it as a human readable string with "-" |
| </span><span class="comment">// separating each group of four digit;, |
| </span><span class="comment">// note that we're mixing a UTF-32 regex, with a UTF-16 |
| </span><span class="comment">// string and a UTF-8 format specifier, and it still all |
| </span><span class="comment">// just works: |
| </span><span class="comment">// |
| </span><span class="keyword">const</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex</span> <span class="identifier">e</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span> |
| <span class="string">"\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"</span><span class="special">);</span> |
| <span class="keyword">const</span> <span class="keyword">char</span><span class="special">*</span> <span class="identifier">human_format</span> <span class="special">=</span> <span class="string">"$1-$2-$3-$4"</span><span class="special">;</span> |
| |
| <span class="identifier">UnicodeString</span> <span class="identifier">human_readable_card_number</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">s</span><span class="special">)</span> |
| <span class="special">{</span> |
| <span class="keyword">return</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_replace</span><span class="special">(</span><span class="identifier">s</span><span class="special">,</span> <span class="identifier">e</span><span class="special">,</span> <span class="identifier">human_format</span><span class="special">);</span> |
| <span class="special">}</span> |
| </pre> |
| </div> |
| <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> |
| <td align="left"></td> |
| <td align="right"><div class="copyright-footer">Copyright © 1998 -2010 John Maddock<p> |
| Distributed under the Boost Software License, Version 1.0. (See accompanying |
| file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) |
| </p> |
| </div></td> |
| </tr></table> |
| <hr> |
| <div class="spirit-nav"> |
| <a accesskey="p" href="unicode_types.html"><img src="../../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../icu.html"><img src="../../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../../index.html"><img src="../../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="unicode_iter.html"><img src="../../../../../../../../doc/src/images/next.png" alt="Next"></a> |
| </div> |
| </body> |
| </html> |