boost/libs/locale/doc/boundary_analysys.txt - nest-cam/4320010/boost - Git at Google

 // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen

 //
 //  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
 //
 //  Distributed under the Boost Software License, Version 1.0. (See
 //  accompanying file LICENSE_1_0.txt or copy at
 //  http://www.boost.org/LICENSE_1_0.txt)
 //

 /*!
 \page boundary_analysys Boundary analysis

 - \ref boundary_analysys_basics
 - \ref boundary_analysys_segments
     - \ref boundary_analysys_segments_basics
     - \ref boundary_analysys_segments_rules
     - \ref boundary_analysys_segments_search
 - \ref boundary_analysys_break
     - \ref boundary_analysys_break_basics
     - \ref boundary_analysys_break_rules
     - \ref boundary_analysys_break_search


 \section boundary_analysys_basics Basics

 Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
 words, or sentences, and find appropriate places for line breaks.

 \note This task is not a trivial task.
 \par
 A Unicode code point and a character are not equivalent, for example:
 Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
 \par
 Words may not be separated by space characters in some languages like in Japanese or Chinese.

 Boost.Locale provides 2 major classes for boundary analysis:

 -   \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
     sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
 -   \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
     It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.

 Each of the classes above use an iterator type as template parameter.
 Both of these classes accept in their constructor:

 - A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
 - The pair of iterators that define the text range that should be analysed
 - A locale parameter (if not given the global one is used)

 For example:
 \code
 namespace ba=boost::locale::boundary;
 std::string text= ... ;
 std::locale loc = ... ;
 ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
 \endcode

 Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
 over the selected segments or boundaries in the text or find a location of a segment or
 boundary for given iterator.


 Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
 or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
 where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
 \c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
 or <tt>CharType const *</tt> are used.

 \section boundary_analysys_segments Iterating Over Segments
 \section boundary_analysys_segments_basics Basic Iteration

 The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.

 It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
 The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
 It can be automatically converted to \c std::basic_string object.

 To perform boundary analysis, we first create an index object and then iterate over it:

 For example:

 \code
 using namespace boost::locale::boundary;
 boost::locale::generator gen;
 std::string text="To be or not to be, that is the question."
 // Create mapping of text for token iterator using global locale.
 ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
 // Print all "words" -- chunks of word boundary
 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
     std::cout <<"\""<< * it << "\", ";
 std::cout << std::endl;
 \endcode

 Would print:

 \verbatim
 "To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
 \endverbatim

 This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
 would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:

 \verbatim
 "生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
 \endverbatim

 The boundary analysis that is done by Boost.Locale
 is much more complicated then just splitting the text according
 to white space characters, even thou it is not perfect.


 \section boundary_analysys_segments_rules Using Rules

 The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
 \ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.

 By default segment_index's iterator return each text segment defined by two boundary points regardless
 the way they were selected. Thus in the example above we could see text segments like "." or " "
 that were selected as words.

 Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
 the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
 and \ref bl_boundary_sentence_rules "sentence" boundary rules.

 For example, by calling

 \code
 map.rule(word_any);
 \endcode

 Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
 ideographic characters ignoring all non-word related characters like white space or punctuation marks.

 So the code:

 \code
 using namespace boost::locale::boundary;
 std::string text="To be or not to be, that is the question."
 // Create mapping of text for token iterator using global locale.
 ssegment_index map(word,text.begin(),text.end());
 // Define a rule
 map.rule(word_any);
 // Print all "words" -- chunks of word boundary
 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
     std::cout <<"\""<< * it << "\", ";
 std::cout << std::endl;
 \endcode

 Would print:

 \verbatim
 "To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
 \endverbatim

 And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.

 \verbatim
 "生", "死", "問題",
 \endverbatim

 You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
 function. Using a bit-mask of rules.

 For example:

 \code
 boost::locale::generator gen;
 using namespace boost::locale::boundary;
 std::string text="生きるか死ぬか、それが問題だ。";
 ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
     std::cout << "Segment " << *it << " contains: ";
     if(it->rule() & word_none)
         std::cout << "white space or punctuation marks ";
     if(it->rule() & word_kana)
         std::cout << "kana characters ";
     if(it->rule() & word_ideo)
         std::cout << "ideographic characters";
     std::cout<< std::endl;
 }
 \endcode

 Would print

 \verbatim
 Segment 生 contains: ideographic characters
 Segment きるか contains: kana characters
 Segment 死 contains: ideographic characters
 Segment ぬか contains: kana characters
 Segment 、 contains: white space or punctuation marks
 Segment それが contains: kana characters
 Segment 問題 contains: ideographic characters
 Segment だ contains: kana characters
 Segment 。 contains: white space or punctuation marks
 \endverbatim

 One important things that should be noted that each segment is defined
 by a pair of boundaries and the rule of its ending point defines
 if it is selected or not.

 In some cases it may be not what we actually look like.

 For example we have a text:

 \verbatim
 Hello! How
 are you?
 \endverbatim

 And we want to fetch all sentences from the text.

 The \ref bl_boundary_sentence_rules "sentence rules" have two options:

 - Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
 - Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"

 Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
 with sentence_term parameter and then run the iterator.

 \code
 boost::locale::generator gen;
 using namespace boost::locale::boundary;
 std::string text=   "Hello! How\n"
                     "are you?\n";
 ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
 map.rule(sentence_term);
 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
     std::cout << "Sentence [" << *it << "]" << std::endl;
 \endcode

 However we would get the expected segments:
 \verbatim
 Sentence [Hello! ]
 Sentence [are you?
 ]
 \endverbatim

 The reason is that "How\n" is still considered a sentence but selected by different
 rule.

 This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
 to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.

 So we add this line:

 \code
 map.full_select(true);
 \endcode

 Right after "map.rule(sentence_term);" and get expected output:

 \verbatim
 Sentence [Hello! ]
 Sentence [How
 are you?
 ]
 \endverbatim

 \subsection boundary_analysys_segments_search Locating Segments

 Sometimes it is useful to find a segment that some specific iterator is pointing on.

 For example a user had clicked at specific point, we want to select a word on this
 location.

 \ref boost::locale::boundary::segment_index "segment_index" provides
 \ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
 member function for this purpose.

 This function returns the iterator to the segmet such that \a p points to.


 For example:

 \code
 text="to be or ";
 ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
 ssegment_index::iterator  p = map.find(text.begin() + 4);
 if(p!=map.end())
     std::cout << *p << std::endl;
 \endcode

 Would print:

 \verbatim
 be
 \endverbatim

 \note

 if the iterator lays inside the segment this segment returned. If the segment does
 not fit the selection rules, then the segment following requested position
 is returned.

 For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:

 - "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
 - "to |be or ", would point to "be" - the iterator at the beginning of the segment "be"
 - "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
 - "to be or| ", would point to end as not valid segment found.


 \section boundary_analysys_break Iterating Over Boundary Points
 \section boundary_analysys_break_basics Basic Iteration

 The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar  to
 \ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
 Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
 \ref boost::locale::boundary::boundary_point "boundary_point" object that
 represents a position in text - a base iterator used that is used for
 iteration of the source text C++ characters.
 The \ref boost::locale::boundary::boundary_point "boundary_point" object
 also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
 function that defines a rule this boundary was selected according to.

 \note The beginning and the ending of the text are considered boundary points, so even
 an empty text consists of at least one boundary point.

 Lets see an example of selecting first two sentences from a text:

 \code
 using namespace boost::locale::boundary;
 boost::locale::generator gen;

 // our text sample
 std::string const text="First sentence. Second sentence! Third one?";
 // Create an index
 sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

 // Count two boundary points
 sboundary_point_index::iterator p = map.begin(),e=map.end();
 int count = 0;
 while(p!=e && count < 2) {
     ++count;
     ++p;
 }

 if(p!=e) {
     std::cout   << "First two sentences are: "
                 << std::string(text.begin(),p->iterator())
                 << std::endl;
 }
 else {
     std::cout   <<"There are less then two sentences in this "
                 <<"text: " << text << std::endl;
 }\endcode

 Would print:

 \verbatim
 First two sentences are: First sentence. Second sentence!
 \endverbatim

 \section boundary_analysys_break_rules Using Rules

 Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the
 \ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
 a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
 member function to filter boundary points that interest us.

 It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
 and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.

 Lets change an example above a little:

 \code
 // our text sample
 std::string const text= "First sentence. Second\n"
                         "sentence! Third one?";
 \endcode

 If we run our program as is on the sample above we would get:
 \verbatim
 First two sentences are: First sentence. Second
 \endverbatim

 Which is not something that we really expected. As the "Second\n"
 is considered an independent sentence that was separated by
 a line separator "Line Feed".

 However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term"
 and the iterator would use only boundary points that are created
 by a sentence terminators like ".!?".

 So by adding:
 \code
 map.rule(sentence_term);
 \endcode

 Right after the generation of the index we would get the desired output:

 \verbatim
 First two sentences are: First sentence. Second
 sentence!
 \endverbatim

 You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
 function to learn about the reason this boundary point was created by comparing it with an appropriate
 mask.

 For example:

 \code
 using namespace boost::locale::boundary;
 boost::locale::generator gen;
 // our text sample
 std::string const text= "First sentence. Second\n"
                         "sentence! Third one?";
 sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

 for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
     if(p->rule() & sentence_term)
         std::cout << "There is a sentence terminator: ";
     else if(p->rule() & sentence_sep)
         std::cout << "There is a sentence separator: ";
     if(p->rule()!=0) // print if some rule exists
         std::cout   << "[" << std::string(text.begin(),p->iterator())
                     << "|" << std::string(p->iterator(),text.end())
                     << "]\n";
 }
 \endcode

 Would give the following output:
 \verbatim
 There is a sentence terminator: [First sentence. |Second
 sentence! Third one?]
 There is a sentence separator: [First sentence. Second
 |sentence! Third one?]
 There is a sentence terminator: [First sentence. Second
 sentence! |Third one?]
 There is a sentence terminator: [First sentence. Second
 sentence! Third one?|]
 \endverbatim

 \subsection boundary_analysys_break_search Locating Boundary Points

 Sometimes it is useful to find a specific boundary point according to given
 iterator.

 \ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
 a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
 function.

 It would return an iterator to a boundary point on \a p's location or at the
 location following it if \a p does not point to appropriate position.

 For example, for word boundary analysis:

 - If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
 - If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)

 For example if we want to select 6 words around specific boundary point we can use following code:

 \code
 using namespace boost::locale::boundary;
 boost::locale::generator gen;
 // our text sample
 std::string const text= "To be or not to be, that is the question.";

 // Create a mapping
 sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
 // Ignore wite space
 map.rule(word_any);

 // define our arbitraty point
 std::string::const_iterator pos = text.begin() + 12; // "no|t";

 // Get the search range
 sboundary_point_index::iterator
     begin =map.begin(),
     end = map.end(),
     it = map.find(pos); // find a boundary

 // go 3 words backward
 for(int count = 0;count <3 && it!=begin; count ++)
     --it;

 // Save the start
 std::string::const_iterator start = *it;

 // go 6 words forward
 for(int count = 0;count < 6 && it!=end; count ++)
     ++it;

 // make sure we at valid position
 if(it==end)
     --it;

 // print the text
 std::cout << std::string(start,it->iterator()) << std::endl;
 \endcode

 That would print:

 \verbatim
  be or not to be, that
 \endverbatim


 */
	// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen

	//
	// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
	//
	// Distributed under the Boost Software License, Version 1.0. (See
	// accompanying file LICENSE_1_0.txt or copy at
	// http://www.boost.org/LICENSE_1_0.txt)
	//

	/*!
	\page boundary_analysys Boundary analysis

	- \ref boundary_analysys_basics
	- \ref boundary_analysys_segments
	- \ref boundary_analysys_segments_basics
	- \ref boundary_analysys_segments_rules
	- \ref boundary_analysys_segments_search
	- \ref boundary_analysys_break
	- \ref boundary_analysys_break_basics
	- \ref boundary_analysys_break_rules
	- \ref boundary_analysys_break_search


	\section boundary_analysys_basics Basics

	Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
	words, or sentences, and find appropriate places for line breaks.

	\note This task is not a trivial task.
	\par
	A Unicode code point and a character are not equivalent, for example:
	Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
	\par
	Words may not be separated by space characters in some languages like in Japanese or Chinese.

	Boost.Locale provides 2 major classes for boundary analysis:

	- \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
	sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
	- \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
	It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.

	Each of the classes above use an iterator type as template parameter.
	Both of these classes accept in their constructor:

	- A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
	- The pair of iterators that define the text range that should be analysed
	- A locale parameter (if not given the global one is used)

	For example:
	\code
	namespace ba=boost::locale::boundary;
	std::string text= ... ;
	std::locale loc = ... ;
	ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
	\endcode

	Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
	over the selected segments or boundaries in the text or find a location of a segment or
	boundary for given iterator.


	Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
	or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
	where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
	\c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
	or <tt>CharType const *</tt> are used.

	\section boundary_analysys_segments Iterating Over Segments
	\section boundary_analysys_segments_basics Basic Iteration

	The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.

	It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
	The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
	It can be automatically converted to \c std::basic_string object.

	To perform boundary analysis, we first create an index object and then iterate over it:

	For example:

	\code
	using namespace boost::locale::boundary;
	boost::locale::generator gen;
	std::string text="To be or not to be, that is the question."
	// Create mapping of text for token iterator using global locale.
	ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
	// Print all "words" -- chunks of word boundary
	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
	std::cout <<"\""<< * it << "\", ";
	std::cout << std::endl;
	\endcode

	Would print:

	\verbatim
	"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
	\endverbatim

	This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
	would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:

	\verbatim
	"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
	\endverbatim

	The boundary analysis that is done by Boost.Locale
	is much more complicated then just splitting the text according
	to white space characters, even thou it is not perfect.


	\section boundary_analysys_segments_rules Using Rules

	The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
	\ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.

	By default segment_index's iterator return each text segment defined by two boundary points regardless
	the way they were selected. Thus in the example above we could see text segments like "." or " "
	that were selected as words.

	Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
	the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
	and \ref bl_boundary_sentence_rules "sentence" boundary rules.

	For example, by calling

	\code
	map.rule(word_any);
	\endcode

	Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
	ideographic characters ignoring all non-word related characters like white space or punctuation marks.

	So the code:

	\code
	using namespace boost::locale::boundary;
	std::string text="To be or not to be, that is the question."
	// Create mapping of text for token iterator using global locale.
	ssegment_index map(word,text.begin(),text.end());
	// Define a rule
	map.rule(word_any);
	// Print all "words" -- chunks of word boundary
	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
	std::cout <<"\""<< * it << "\", ";
	std::cout << std::endl;
	\endcode

	Would print:

	\verbatim
	"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
	\endverbatim

	And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.

	\verbatim
	"生", "死", "問題",
	\endverbatim

	You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
	function. Using a bit-mask of rules.

	For example:

	\code
	boost::locale::generator gen;
	using namespace boost::locale::boundary;
	std::string text="生きるか死ぬか、それが問題だ。";
	ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
	std::cout << "Segment " << *it << " contains: ";
	if(it->rule() & word_none)
	std::cout << "white space or punctuation marks ";
	if(it->rule() & word_kana)
	std::cout << "kana characters ";
	if(it->rule() & word_ideo)
	std::cout << "ideographic characters";
	std::cout<< std::endl;
	}
	\endcode

	Would print

	\verbatim
	Segment 生 contains: ideographic characters
	Segment きるか contains: kana characters
	Segment 死 contains: ideographic characters
	Segment ぬか contains: kana characters
	Segment 、 contains: white space or punctuation marks
	Segment それが contains: kana characters
	Segment 問題 contains: ideographic characters
	Segment だ contains: kana characters
	Segment 。 contains: white space or punctuation marks
	\endverbatim

	One important things that should be noted that each segment is defined
	by a pair of boundaries and the rule of its ending point defines
	if it is selected or not.

	In some cases it may be not what we actually look like.

	For example we have a text:

	\verbatim
	Hello! How
	are you?
	\endverbatim

	And we want to fetch all sentences from the text.

	The \ref bl_boundary_sentence_rules "sentence rules" have two options:

	- Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
	- Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"

	Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
	with sentence_term parameter and then run the iterator.

	\code
	boost::locale::generator gen;
	using namespace boost::locale::boundary;
	std::string text= "Hello! How\n"
	"are you?\n";
	ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
	map.rule(sentence_term);
	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
	std::cout << "Sentence [" << *it << "]" << std::endl;
	\endcode

	However we would get the expected segments:
	\verbatim
	Sentence [Hello! ]
	Sentence [are you?
	]
	\endverbatim

	The reason is that "How\n" is still considered a sentence but selected by different
	rule.

	This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
	to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.

	So we add this line:

	\code
	map.full_select(true);
	\endcode

	Right after "map.rule(sentence_term);" and get expected output:

	\verbatim
	Sentence [Hello! ]
	Sentence [How
	are you?
	]
	\endverbatim

	\subsection boundary_analysys_segments_search Locating Segments

	Sometimes it is useful to find a segment that some specific iterator is pointing on.

	For example a user had clicked at specific point, we want to select a word on this
	location.

	\ref boost::locale::boundary::segment_index "segment_index" provides
	\ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
	member function for this purpose.

	This function returns the iterator to the segmet such that \a p points to.


	For example:

	\code
	text="to be or ";
	ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
	ssegment_index::iterator p = map.find(text.begin() + 4);
	if(p!=map.end())
	std::cout << *p << std::endl;
	\endcode

	Would print:

	\verbatim
	be
	\endverbatim

	\note

	if the iterator lays inside the segment this segment returned. If the segment does
	not fit the selection rules, then the segment following requested position
	is returned.

	For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:

	- "t\|o be or ", would point to "to" - the iterator in the middle of segment "to".
	- "to \|be or ", would point to "be" - the iterator at the beginning of the segment "be"
	- "to\| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
	- "to be or\| ", would point to end as not valid segment found.


	\section boundary_analysys_break Iterating Over Boundary Points
	\section boundary_analysys_break_basics Basic Iteration

	The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar to
	\ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
	Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
	\ref boost::locale::boundary::boundary_point "boundary_point" object that
	represents a position in text - a base iterator used that is used for
	iteration of the source text C++ characters.
	The \ref boost::locale::boundary::boundary_point "boundary_point" object
	also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
	function that defines a rule this boundary was selected according to.

	\note The beginning and the ending of the text are considered boundary points, so even
	an empty text consists of at least one boundary point.

	Lets see an example of selecting first two sentences from a text:

	\code
	using namespace boost::locale::boundary;
	boost::locale::generator gen;

	// our text sample
	std::string const text="First sentence. Second sentence! Third one?";
	// Create an index
	sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

	// Count two boundary points
	sboundary_point_index::iterator p = map.begin(),e=map.end();
	int count = 0;
	while(p!=e && count < 2) {
	++count;
	++p;
	}

	if(p!=e) {
	std::cout << "First two sentences are: "
	<< std::string(text.begin(),p->iterator())
	<< std::endl;
	}
	else {
	std::cout <<"There are less then two sentences in this "
	<<"text: " << text << std::endl;
	}\endcode

	Would print:

	\verbatim
	First two sentences are: First sentence. Second sentence!
	\endverbatim

	\section boundary_analysys_break_rules Using Rules

	Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the
	\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
	a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
	member function to filter boundary points that interest us.

	It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
	and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.

	Lets change an example above a little:

	\code
	// our text sample
	std::string const text= "First sentence. Second\n"
	"sentence! Third one?";
	\endcode

	If we run our program as is on the sample above we would get:
	\verbatim
	First two sentences are: First sentence. Second
	\endverbatim

	Which is not something that we really expected. As the "Second\n"
	is considered an independent sentence that was separated by
	a line separator "Line Feed".

	However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term"
	and the iterator would use only boundary points that are created
	by a sentence terminators like ".!?".

	So by adding:
	\code
	map.rule(sentence_term);
	\endcode

	Right after the generation of the index we would get the desired output:

	\verbatim
	First two sentences are: First sentence. Second
	sentence!
	\endverbatim

	You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
	function to learn about the reason this boundary point was created by comparing it with an appropriate
	mask.

	For example:

	\code
	using namespace boost::locale::boundary;
	boost::locale::generator gen;
	// our text sample
	std::string const text= "First sentence. Second\n"
	"sentence! Third one?";
	sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

	for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
	if(p->rule() & sentence_term)
	std::cout << "There is a sentence terminator: ";
	else if(p->rule() & sentence_sep)
	std::cout << "There is a sentence separator: ";
	if(p->rule()!=0) // print if some rule exists
	std::cout << "[" << std::string(text.begin(),p->iterator())
	<< "\|" << std::string(p->iterator(),text.end())
	<< "]\n";
	}
	\endcode

	Would give the following output:
	\verbatim
	There is a sentence terminator: [First sentence. \|Second
	sentence! Third one?]
	There is a sentence separator: [First sentence. Second
	\|sentence! Third one?]
	There is a sentence terminator: [First sentence. Second
	sentence! \|Third one?]
	There is a sentence terminator: [First sentence. Second
	sentence! Third one?\|]
	\endverbatim

	\subsection boundary_analysys_break_search Locating Boundary Points

	Sometimes it is useful to find a specific boundary point according to given
	iterator.

	\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
	a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
	function.

	It would return an iterator to a boundary point on \a p's location or at the
	location following it if \a p does not point to appropriate position.

	For example, for word boundary analysis:

	- If a base iterator points to "to \|be", then the returned boundary point would be "to \|be" (same position)
	- If a base iterator points to "t\|o be", then the returned boundary point would be "to\| be" (next valid position)

	For example if we want to select 6 words around specific boundary point we can use following code:

	\code
	using namespace boost::locale::boundary;
	boost::locale::generator gen;
	// our text sample
	std::string const text= "To be or not to be, that is the question.";

	// Create a mapping
	sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
	// Ignore wite space
	map.rule(word_any);

	// define our arbitraty point
	std::string::const_iterator pos = text.begin() + 12; // "no\|t";

	// Get the search range
	sboundary_point_index::iterator
	begin =map.begin(),
	end = map.end(),
	it = map.find(pos); // find a boundary

	// go 3 words backward
	for(int count = 0;count <3 && it!=begin; count ++)
	--it;

	// Save the start
	std::string::const_iterator start = *it;

	// go 6 words forward
	for(int count = 0;count < 6 && it!=end; count ++)
	++it;

	// make sure we at valid position
	if(it==end)
	--it;

	// print the text
	std::cout << std::string(start,it->iterator()) << std::endl;
	\endcode

	That would print:

	\verbatim
	be or not to be, that
	\endverbatim


	*/