boost/libs/locale/doc/glossary.txt - nest-cam/4320010/boost - Git at Google

 //
 //  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
 //
 //  Distributed under the Boost Software License, Version 1.0. (See
 //  accompanying file LICENSE_1_0.txt or copy at
 //  http://www.boost.org/LICENSE_1_0.txt)
 //

 // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
 /*!
 \page glossary Glossary

 -   \anchor term_bmp <b>Basic Multilingual Plane (BMP)</b> -- a part of
     the <i>Universal Character Set</i> with code points in the range U-0000--U-FFFF.
     The most commonly used UCS characters lay in this plane, including all Western, Cyrillic, Hebrew, Thai, Arabic and CJK characters.
     However there are many characters that lay outside the BMP and they are absolutely required for correct support of East Asian languages.
 -   \b Code \b Point -- a unique number that represents a "character" in the Universal Character Set. Code points lay in the range of
     0-0x10FFFF, and are usually displayed as U+XXXX or U+XXXXXX, where X represents a hexadecimal digit.
 -   \anchor term_collation \b Collation -- a sorting order for text, usually alphabetical. It can differ between languages and countries, even for the same
     characters.
 -   \b Encoding - a representation of a character set. Some encodings are capable of representing the full UCS range, like UTF-8, and
     others can only represent a subset of it -- ISO-8859-8 represents only a small subset of about 250 characters of the UCS.
     \n
     Non-Unicode encodings are still very popular, for example the Latin-1 (or ISO-8859-1) encoding covers most of the characters for
     Western European languages and significantly simplifies the processing of text for applications designed to handle only such languages.
     \n
     For Boost.Locale you should provide an eight-bit (\c std::string) encoding as part of the locale name, like \c en_US.UTF-8 or
     \c he_IL.cp1255 . \c UTF-8 is recommended.
 -   \b Facet - or \c std::locale::facet -- a base class that every object that describes a specific locale is derived from. Facets can be
     added to a locale to provide additional culture information.
 -   \b Formatting - representation of various values according to locale preferences. For example, a number 1234.5 (C representation)
     should be displayed as 1,234.5 in the US locale and 1.234,5 in the Russian locale. The date November 1st, 2005 would be represented as
     11/01/2005 in the United States, and 01.11.2005 in Russia. This is an important part of localization.
     \n
     For example: does "You have to bring 134,230 kg of rice on 04/01/2010" means "134 tons of rice on the first of April" or "134 kg 230 g
     of rice on January 4th"? That is quite different.
 -   \b Gettext - The GNU localization library used for message formatting. Today it is the de-facto standard localization library in the
     Open Source world. Boost.Locale message formatting is entirely built on Gettext message catalogs.
 -   \b Locale - a set of parameters that define specific preferences for users in different cultures. It is generally defined by language,
     country, variants, and encoding, and provides information like: collation order, date-time formatting, message formatting, number
     formatting and many others. In C++, locale information is represented by the \c std::locale class.
 -   \b Message \b Formatting -- the representation of user interface strings in the user's language. The process of translation of UI
     strings is generally done using some dictionary provided by the program's translator.
 -   \b Message \b Domain -- in \a gettext terms, the keyword that represents a message catalog. This is usually an application name. When
     \a gettext and Boost.Locale search for a specific message catalog, they search in the specified path for a file named after the domain.
 -   \anchor term_normalization
     \b Normalization - Unicode normalization is the process of converting strings to a standard form, suitable for text processing and
     comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the
     diaeresis "¨". Normalization is an important part of Unicode text processing.
     \n
     Normalization is not locale-dependent, but because it is an important part of Unicode processing, it is included in the Boost.Locale
     library.
 -   \b UCS-2 - a fixed-width Unicode encoding, capable of representing only code points in the <i>Basic Multilingual Plane (BMP)</i>.
     It is a legacy encoding and is not recommended for use.
 -   \b Unicode -- the industry standard that defines the representation and manipulation of text suitable for most languages and countries.
     It should not be confused with the <i>Universal Character Set</i>, it is a much larger standard that also defines algorithms like
     bidirectional display order, Arabic shaping, etc.
 -   <b>Universal Character Set (UCS)</b> - an international standard that defines a set of characters for many scripts and their
     \a code \a points.
 -   \b UTF-8 - a variable-width Unicode transformation format. Each UCS code point is represented as a sequence of between 1 and 4 octets
     that can be easily distinguished. It includes ASCII as a subset. It is the most popular Unicode encoding for web applications, data
     transfer and storage, and is the de-facto standard encoding for most POSIX operation systems.
 -   \b UTF-16 - a variable-width Unicode transformation format. Each UCS code point is represented as a sequence of one or two 16-bit words.
     It is a very popular encoding for platforms such as the Win32 API, Java, C#, Python, etc. However, it is frequently confused with the
     _UCS-2_ fixed-width encoding, which can only represent characters in the <i>Basic Multilingual Plane (BMP)</i>.
     \n
     This encoding is used for \c std::wstring under the Win32 platform, where <tt>sizeof(wchar_t)==2</tt>.
 -   \b UTF-32/UCS-4 - a fixed-width Unicode transformation format, where each code point is represented as a single 32-bit word. It has
     the advantage of simple code point representation, but is wasteful in terms of memory usage. It is used for \c std::wstring encoding
     for most POSIX platforms, where <tt>sizeof(wchar_t)==4</tt>.
 -   \anchor term_case_folding <b>Case Folding</b> - is a process of converting a text to case independent representation.
     For example case folding for a word "Grüßen" is "grüssen" - where the letter "ß" is represented in case independent way as "ss".
 -   \anchor term_title_case <b>Title Case</b> -
     Is a text conversion where the words are capitalized. For example "hello world" is converted
     to "Hello World"

 */
	//
	// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
	//
	// Distributed under the Boost Software License, Version 1.0. (See
	// accompanying file LICENSE_1_0.txt or copy at
	// http://www.boost.org/LICENSE_1_0.txt)
	//

	// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
	/*!
	\page glossary Glossary

	- \anchor term_bmp <b>Basic Multilingual Plane (BMP)</b> -- a part of
	the <i>Universal Character Set</i> with code points in the range U-0000--U-FFFF.
	The most commonly used UCS characters lay in this plane, including all Western, Cyrillic, Hebrew, Thai, Arabic and CJK characters.
	However there are many characters that lay outside the BMP and they are absolutely required for correct support of East Asian languages.
	- \b Code \b Point -- a unique number that represents a "character" in the Universal Character Set. Code points lay in the range of
	0-0x10FFFF, and are usually displayed as U+XXXX or U+XXXXXX, where X represents a hexadecimal digit.
	- \anchor term_collation \b Collation -- a sorting order for text, usually alphabetical. It can differ between languages and countries, even for the same
	characters.
	- \b Encoding - a representation of a character set. Some encodings are capable of representing the full UCS range, like UTF-8, and
	others can only represent a subset of it -- ISO-8859-8 represents only a small subset of about 250 characters of the UCS.
	\n
	Non-Unicode encodings are still very popular, for example the Latin-1 (or ISO-8859-1) encoding covers most of the characters for
	Western European languages and significantly simplifies the processing of text for applications designed to handle only such languages.
	\n
	For Boost.Locale you should provide an eight-bit (\c std::string) encoding as part of the locale name, like \c en_US.UTF-8 or
	\c he_IL.cp1255 . \c UTF-8 is recommended.
	- \b Facet - or \c std::locale::facet -- a base class that every object that describes a specific locale is derived from. Facets can be
	added to a locale to provide additional culture information.
	- \b Formatting - representation of various values according to locale preferences. For example, a number 1234.5 (C representation)
	should be displayed as 1,234.5 in the US locale and 1.234,5 in the Russian locale. The date November 1st, 2005 would be represented as
	11/01/2005 in the United States, and 01.11.2005 in Russia. This is an important part of localization.
	\n
	For example: does "You have to bring 134,230 kg of rice on 04/01/2010" means "134 tons of rice on the first of April" or "134 kg 230 g
	of rice on January 4th"? That is quite different.
	- \b Gettext - The GNU localization library used for message formatting. Today it is the de-facto standard localization library in the
	Open Source world. Boost.Locale message formatting is entirely built on Gettext message catalogs.
	- \b Locale - a set of parameters that define specific preferences for users in different cultures. It is generally defined by language,
	country, variants, and encoding, and provides information like: collation order, date-time formatting, message formatting, number
	formatting and many others. In C++, locale information is represented by the \c std::locale class.
	- \b Message \b Formatting -- the representation of user interface strings in the user's language. The process of translation of UI
	strings is generally done using some dictionary provided by the program's translator.
	- \b Message \b Domain -- in \a gettext terms, the keyword that represents a message catalog. This is usually an application name. When
	\a gettext and Boost.Locale search for a specific message catalog, they search in the specified path for a file named after the domain.
	- \anchor term_normalization
	\b Normalization - Unicode normalization is the process of converting strings to a standard form, suitable for text processing and
	comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the
	diaeresis "¨". Normalization is an important part of Unicode text processing.
	\n
	Normalization is not locale-dependent, but because it is an important part of Unicode processing, it is included in the Boost.Locale
	library.
	- \b UCS-2 - a fixed-width Unicode encoding, capable of representing only code points in the <i>Basic Multilingual Plane (BMP)</i>.
	It is a legacy encoding and is not recommended for use.
	- \b Unicode -- the industry standard that defines the representation and manipulation of text suitable for most languages and countries.
	It should not be confused with the <i>Universal Character Set</i>, it is a much larger standard that also defines algorithms like
	bidirectional display order, Arabic shaping, etc.
	- <b>Universal Character Set (UCS)</b> - an international standard that defines a set of characters for many scripts and their
	\a code \a points.
	- \b UTF-8 - a variable-width Unicode transformation format. Each UCS code point is represented as a sequence of between 1 and 4 octets
	that can be easily distinguished. It includes ASCII as a subset. It is the most popular Unicode encoding for web applications, data
	transfer and storage, and is the de-facto standard encoding for most POSIX operation systems.
	- \b UTF-16 - a variable-width Unicode transformation format. Each UCS code point is represented as a sequence of one or two 16-bit words.
	It is a very popular encoding for platforms such as the Win32 API, Java, C#, Python, etc. However, it is frequently confused with the
	_UCS-2_ fixed-width encoding, which can only represent characters in the <i>Basic Multilingual Plane (BMP)</i>.
	\n
	This encoding is used for \c std::wstring under the Win32 platform, where <tt>sizeof(wchar_t)==2</tt>.
	- \b UTF-32/UCS-4 - a fixed-width Unicode transformation format, where each code point is represented as a single 32-bit word. It has
	the advantage of simple code point representation, but is wasteful in terms of memory usage. It is used for \c std::wstring encoding
	for most POSIX platforms, where <tt>sizeof(wchar_t)==4</tt>.
	- \anchor term_case_folding <b>Case Folding</b> - is a process of converting a text to case independent representation.
	For example case folding for a word "Grüßen" is "grüssen" - where the letter "ß" is represented in case independent way as "ss".
	- \anchor term_title_case <b>Title Case</b> -
	Is a text conversion where the words are capitalized. For example "hello world" is converted
	to "Hello World"

	*/