arm-2011.03/share/doc/arm-arm-none-linux-gnueabi/html/libc/Extended-Char-Intro.html - nest-cam/v350/arm-none-linux-gnueabi-i686-pc-linux-gnu - Git at Google

 <html lang="en">
 <head>
 <title>Extended Char Intro - The GNU C Library</title>
 <meta http-equiv="Content-Type" content="text/html">
 <meta name="description" content="The GNU C Library">
 <meta name="generator" content="makeinfo 4.13">
 <link title="Top" rel="start" href="index.html#Top">
 <link rel="up" href="Character-Set-Handling.html#Character-Set-Handling" title="Character Set Handling">
 <link rel="next" href="Charset-Function-Overview.html#Charset-Function-Overview" title="Charset Function Overview">
 <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
 <!--
 This file documents the GNU C library.

 This is Edition 0.12, last updated 2007-10-27,
 of `The GNU C Library Reference Manual', for version
 2.8 (Sourcery G++ Lite 2011.03-41).

 Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2002,
 2003, 2007, 2008, 2010 Free Software Foundation, Inc.

 Permission is granted to copy, distribute and/or modify this document
 under the terms of the GNU Free Documentation License, Version 1.3 or
 any later version published by the Free Software Foundation; with the
 Invariant Sections being ``Free Software Needs Free Documentation''
 and ``GNU Lesser General Public License'', the Front-Cover texts being
 ``A GNU Manual'', and with the Back-Cover Texts as in (a) below.  A
 copy of the license is included in the section entitled "GNU Free
 Documentation License".

 (a) The FSF's Back-Cover Text is: ``You have the freedom to
 copy and modify this GNU manual.  Buying copies from the FSF
 supports it in developing GNU and promoting software freedom.''-->
 <meta http-equiv="Content-Style-Type" content="text/css">
 <style type="text/css"><!--
   pre.display { font-family:inherit }
   pre.format  { font-family:inherit }
   pre.smalldisplay { font-family:inherit; font-size:smaller }
   pre.smallformat  { font-family:inherit; font-size:smaller }
   pre.smallexample { font-size:smaller }
   pre.smalllisp    { font-size:smaller }
   span.sc    { font-variant:small-caps }
   span.roman { font-family:serif; font-weight:normal; }
   span.sansserif { font-family:sans-serif; font-weight:normal; }
 --></style>
 <link rel="stylesheet" type="text/css" href="../cs.css">
 </head>
 <body>
 <div class="node">
 <a name="Extended-Char-Intro"></a>
 <p>
 Next:&nbsp;<a rel="next" accesskey="n" href="Charset-Function-Overview.html#Charset-Function-Overview">Charset Function Overview</a>,
 Up:&nbsp;<a rel="up" accesskey="u" href="Character-Set-Handling.html#Character-Set-Handling">Character Set Handling</a>
 <hr>
 </div>

 <h3 class="section">6.1 Introduction to Extended Characters</h3>

 <p>A variety of solutions is available to overcome the differences between
 character sets with a 1:1 relation between bytes and characters and
 character sets with ratios of 2:1 or 4:1.  The remainder of this
 section gives a few examples to help understand the design decisions
 made while developing the functionality of the C&nbsp;library<!-- /@w -->.

    <p><a name="index-internal-representation-610"></a>A distinction we have to make right away is between internal and
 external representation.  <dfn>Internal representation</dfn> means the
 representation used by a program while keeping the text in memory.
 External representations are used when text is stored or transmitted
 through some communication channel.  Examples of external
 representations include files waiting in a directory to be
 read and parsed.

    <p>Traditionally there has been no difference between the two representations.
 It was equally comfortable and useful to use the same single-byte
 representation internally and externally.  This comfort level decreases
 with more and larger character sets.

    <p>One of the problems to overcome with the internal representation is
 handling text that is externally encoded using different character
 sets.  Assume a program that reads two texts and compares them using
 some metric.  The comparison can be usefully done only if the texts are
 internally kept in a common format.

    <p><a name="index-wide-character-611"></a>For such a common format (= character set) eight bits are certainly
 no longer enough.  So the smallest entity will have to grow: <dfn>wide
 characters</dfn> will now be used.  Instead of one byte per character, two or
 four will be used instead.  (Three are not good to address in memory and
 more than four bytes seem not to be necessary).

    <p><a name="index-Unicode-612"></a><a name="index-ISO-10646-613"></a>As shown in some other part of this manual,
 <!-- !!! Ahem, wide char string functions are not yet covered - drepper -->
 a completely new family has been created of functions that can handle wide
 character texts in memory.  The most commonly used character sets for such
 internal wide character representations are Unicode and ISO&nbsp;10646<!-- /@w -->
 (also known as UCS for Universal Character Set).  Unicode was originally
 planned as a 16-bit character set; whereas, ISO&nbsp;10646<!-- /@w --> was designed to
 be a 31-bit large code space.  The two standards are practically identical.
 They have the same character repertoire and code table, but Unicode specifies
 added semantics.  At the moment, only characters in the first <code>0x10000</code>
 code positions (the so-called Basic Multilingual Plane, BMP) have been
 assigned, but the assignment of more specialized characters outside this
 16-bit space is already in progress.  A number of encodings have been
 defined for Unicode and ISO&nbsp;10646<!-- /@w --> characters:
 <a name="index-UCS_002d2-614"></a><a name="index-UCS_002d4-615"></a><a name="index-UTF_002d8-616"></a><a name="index-UTF_002d16-617"></a>UCS-2 is a 16-bit word that can only represent characters
 from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
 and ISO&nbsp;10646<!-- /@w --> character, UTF-8 is an ASCII compatible encoding where
 ASCII characters are represented by ASCII bytes and non-ASCII characters
 by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
 of UCS-2 in which pairs of certain UCS-2 words can be used to encode
 non-BMP characters up to <code>0x10ffff</code>.

    <p>To represent wide characters the <code>char</code> type is not suitable.  For
 this reason the ISO&nbsp;C<!-- /@w --> standard introduces a new type that is
 designed to keep one character of a wide character string.  To maintain
 the similarity there is also a type corresponding to <code>int</code> for
 those functions that take a single wide character.

 <!-- stddef.h -->
 <!-- ISO -->
 <div class="defun">
 &mdash; Data type: <b>wchar_t</b><var><a name="index-wchar_005ft-618"></a></var><br>
 <blockquote><p>This data type is used as the base type for wide character strings.
 In other words, arrays of objects of this type are the equivalent of
 <code>char[]</code> for multibyte character strings.  The type is defined in
 <samp><span class="file">stddef.h</span></samp>.

         <p>The ISO&nbsp;C90<!-- /@w --> standard, where <code>wchar_t</code> was introduced, does not
 say anything specific about the representation.  It only requires that
 this type is capable of storing all elements of the basic character set.
 Therefore it would be legitimate to define <code>wchar_t</code> as <code>char</code>,
 which might make sense for embedded systems.

         <p>But for GNU systems <code>wchar_t</code> is always 32 bits wide and, therefore,
 capable of representing all UCS-4 values and, therefore, covering all of
 ISO&nbsp;10646<!-- /@w -->.  Some Unix systems define <code>wchar_t</code> as a 16-bit type
 and thereby follow Unicode very strictly.  This definition is perfectly
 fine with the standard, but it also means that to represent all
 characters from Unicode and ISO&nbsp;10646<!-- /@w --> one has to use UTF-16 surrogate
 characters, which is in fact a multi-wide-character encoding.  But
 resorting to multi-wide-character encoding contradicts the purpose of the
 <code>wchar_t</code> type.
 </p></blockquote></div>

 <!-- wchar.h -->
 <!-- ISO -->
 <div class="defun">
 &mdash; Data type: <b>wint_t</b><var><a name="index-wint_005ft-619"></a></var><br>
 <blockquote><p><code>wint_t</code> is a data type used for parameters and variables that
 contain a single wide character.  As the name suggests this type is the
 equivalent of <code>int</code> when using the normal <code>char</code> strings.  The
 types <code>wchar_t</code> and <code>wint_t</code> often have the same
 representation if their size is 32 bits wide but if <code>wchar_t</code> is
 defined as <code>char</code> the type <code>wint_t</code> must be defined as
 <code>int</code> due to the parameter promotion.

         <p><a name="index-wchar_002eh-620"></a>This type is defined in <samp><span class="file">wchar.h</span></samp> and was introduced in
 Amendment&nbsp;1<!-- /@w --> to ISO&nbsp;C90<!-- /@w -->.
 </p></blockquote></div>

    <p>As there are for the <code>char</code> data type macros are available for
 specifying the minimum and maximum value representable in an object of
 type <code>wchar_t</code>.

 <!-- wchar.h -->
 <!-- ISO -->
 <div class="defun">
 &mdash; Macro: wint_t <b>WCHAR_MIN</b><var><a name="index-WCHAR_005fMIN-621"></a></var><br>
 <blockquote><p>The macro <code>WCHAR_MIN</code> evaluates to the minimum value representable
 by an object of type <code>wint_t</code>.

         <p>This macro was introduced in Amendment&nbsp;1<!-- /@w --> to ISO&nbsp;C90<!-- /@w -->.
 </p></blockquote></div>

 <!-- wchar.h -->
 <!-- ISO -->
 <div class="defun">
 &mdash; Macro: wint_t <b>WCHAR_MAX</b><var><a name="index-WCHAR_005fMAX-622"></a></var><br>
 <blockquote><p>The macro <code>WCHAR_MAX</code> evaluates to the maximum value representable
 by an object of type <code>wint_t</code>.

         <p>This macro was introduced in Amendment&nbsp;1<!-- /@w --> to ISO&nbsp;C90<!-- /@w -->.
 </p></blockquote></div>

    <p>Another special wide character value is the equivalent to <code>EOF</code>.

 <!-- wchar.h -->
 <!-- ISO -->
 <div class="defun">
 &mdash; Macro: wint_t <b>WEOF</b><var><a name="index-WEOF-623"></a></var><br>
 <blockquote><p>The macro <code>WEOF</code> evaluates to a constant expression of type
 <code>wint_t</code> whose value is different from any member of the extended
 character set.

         <p><code>WEOF</code> need not be the same value as <code>EOF</code> and unlike
 <code>EOF</code> it also need <em>not</em> be negative.  In other words, sloppy
 code like

      <pre class="smallexample">          {
             int c;
             ...
             while ((c = getc (fp)) &lt; 0)
               ...
           }
 </pre>
         <p class="noindent">has to be rewritten to use <code>WEOF</code> explicitly when wide characters
 are used:

      <pre class="smallexample">          {
             wint_t c;
             ...
             while ((c = wgetc (fp)) != WEOF)
               ...
           }
 </pre>
         <p><a name="index-wchar_002eh-624"></a>This macro was introduced in Amendment&nbsp;1<!-- /@w --> to ISO&nbsp;C90<!-- /@w --> and is
 defined in <samp><span class="file">wchar.h</span></samp>.
 </p></blockquote></div>

    <p>These internal representations present problems when it comes to storing
 and transmittal.  Because each single wide character consists of more
 than one byte, they are effected by byte-ordering.  Thus, machines with
 different endianesses would see different values when accessing the same
 data.  This byte ordering concern also applies for communication protocols
 that are all byte-based and therefore require that the sender has to
 decide about splitting the wide character in bytes.  A last (but not least
 important) point is that wide characters often require more storage space
 than a customized byte-oriented character set.

    <p><a name="index-multibyte-character-625"></a><a name="index-EBCDIC-626"></a>For all the above reasons, an external encoding that is different from
 the internal encoding is often used if the latter is UCS-2 or UCS-4.
 The external encoding is byte-based and can be chosen appropriately for
 the environment and for the texts to be handled.  A variety of different
 character sets can be used for this external encoding (information that
 will not be exhaustively presented here&ndash;instead, a description of the
 major groups will suffice).  All of the ASCII-based character sets
 fulfill one requirement: they are "filesystem safe."  This means that
 the character <code>'/'</code> is used in the encoding <em>only</em> to
 represent itself.  Things are a bit different for character sets like
 EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
 family used by IBM), but if the operation system does not understand
 EBCDIC directly the parameters-to-system calls have to be converted
 first anyhow.

      <ul>
 <li>The simplest character sets are single-byte character sets.  There can
 be only up to 256 characters (for 8&nbsp;bit<!-- /@w --> character sets), which is
 not sufficient to cover all languages but might be sufficient to handle
 a specific text.  Handling of a 8&nbsp;bit<!-- /@w --> character sets is simple.  This
 is not true for other kinds presented later, and therefore, the
 application one uses might require the use of 8&nbsp;bit<!-- /@w --> character sets.

      <p><a name="index-ISO-2022-627"></a><li>The ISO&nbsp;2022<!-- /@w --> standard defines a mechanism for extended character
 sets where one character <em>can</em> be represented by more than one
 byte.  This is achieved by associating a state with the text.
 Characters that can be used to change the state can be embedded in the
 text.  Each byte in the text might have a different interpretation in each
 state.  The state might even influence whether a given byte stands for a
 character on its own or whether it has to be combined with some more
 bytes.

      <p><a name="index-EUC-628"></a><a name="index-Shift_005fJIS-629"></a><a name="index-SJIS-630"></a>In most uses of ISO&nbsp;2022<!-- /@w --> the defined character sets do not allow
 state changes that cover more than the next character.  This has the
 big advantage that whenever one can identify the beginning of the byte
 sequence of a character one can interpret a text correctly.  Examples of
 character sets using this policy are the various EUC character sets
 (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
 or Shift_JIS (SJIS, a Japanese encoding).

      <p>But there are also character sets using a state that is valid for more
 than one character and has to be changed by another byte sequence.
 Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.

      <li><a name="index-ISO-6937-631"></a>Early attempts to fix 8 bit character sets for other languages using the
 Roman alphabet lead to character sets like ISO&nbsp;6937<!-- /@w -->.  Here bytes
 representing characters like the acute accent do not produce output
 themselves: one has to combine them with other characters to get the
 desired result.  For example, the byte sequence <code>0xc2 0x61</code>
 (non-spacing acute accent, followed by lower-case `a') to get the &ldquo;small
 a with  acute&rdquo; character.  To get the acute accent character on its own,
 one has to write <code>0xc2 0x20</code> (the non-spacing acute followed by a
 space).

      <p>Character sets like ISO&nbsp;6937<!-- /@w --> are used in some embedded systems such
 as teletex.

      <li><a name="index-UTF_002d8-632"></a>Instead of converting the Unicode or ISO&nbsp;10646<!-- /@w --> text used internally,
 it is often also sufficient to simply use an encoding different than
 UCS-2/UCS-4.  The Unicode and ISO&nbsp;10646<!-- /@w --> standards even specify such an
 encoding: UTF-8.  This encoding is able to represent all of ISO&nbsp;10646<!-- /@w --> 31 bits in a byte string of length one to six.

      <p><a name="index-UTF_002d7-633"></a>There were a few other attempts to encode ISO&nbsp;10646<!-- /@w --> such as UTF-7,
 but UTF-8 is today the only encoding that should be used.  In fact, with
 any luck UTF-8 will soon be the only external encoding that has to be
 supported.  It proves to be universally usable and its only disadvantage
 is that it favors Roman languages by making the byte string
 representation of other scripts (Cyrillic, Greek, Asian scripts) longer
 than necessary if using a specific character set for these scripts.
 Methods like the Unicode compression scheme can alleviate these
 problems.
 </ul>

    <p>The question remaining is: how to select the character set or encoding
 to use.  The answer: you cannot decide about it yourself, it is decided
 by the developers of the system or the majority of the users.  Since the
 goal is interoperability one has to use whatever the other people one
 works with use.  If there are no constraints, the selection is based on
 the requirements the expected circle of users will have.  In other words,
 if a project is expected to be used in only, say, Russia it is fine to use
 KOI8-R or a similar character set.  But if at the same time people from,
 say, Greece are participating one should use a character set that allows
 all people to collaborate.

    <p>The most widely useful solution seems to be: go with the most general
 character set, namely ISO&nbsp;10646<!-- /@w -->.  Use UTF-8 as the external encoding
 and problems about users not being able to use their own language
 adequately are a thing of the past.

    <p>One final comment about the choice of the wide character representation
 is necessary at this point.  We have said above that the natural choice
 is using Unicode or ISO&nbsp;10646<!-- /@w -->.  This is not required, but at least
 encouraged, by the ISO&nbsp;C<!-- /@w --> standard.  The standard defines at least a
 macro <code>__STDC_ISO_10646__</code> that is only defined on systems where
 the <code>wchar_t</code> type encodes ISO&nbsp;10646<!-- /@w --> characters.  If this
 symbol is not defined one should avoid making assumptions about the wide
 character representation.  If the programmer uses only the functions
 provided by the C library to handle wide character strings there should
 be no compatibility problems with other systems.

    </body></html>
	<html lang="en">
	<head>
	<title>Extended Char Intro - The GNU C Library</title>
	<meta http-equiv="Content-Type" content="text/html">
	<meta name="description" content="The GNU C Library">
	<meta name="generator" content="makeinfo 4.13">
	<link title="Top" rel="start" href="index.html#Top">
	<link rel="up" href="Character-Set-Handling.html#Character-Set-Handling" title="Character Set Handling">
	<link rel="next" href="Charset-Function-Overview.html#Charset-Function-Overview" title="Charset Function Overview">
	<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
	<!--
	This file documents the GNU C library.

	This is Edition 0.12, last updated 2007-10-27,
	of `The GNU C Library Reference Manual', for version
	2.8 (Sourcery G++ Lite 2011.03-41).

	Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2002,
	2003, 2007, 2008, 2010 Free Software Foundation, Inc.

	Permission is granted to copy, distribute and/or modify this document
	under the terms of the GNU Free Documentation License, Version 1.3 or
	any later version published by the Free Software Foundation; with the
	Invariant Sections being ``Free Software Needs Free Documentation''
	and ``GNU Lesser General Public License'', the Front-Cover texts being
	``A GNU Manual'', and with the Back-Cover Texts as in (a) below. A
	copy of the license is included in the section entitled "GNU Free
	Documentation License".

	(a) The FSF's Back-Cover Text is: ``You have the freedom to
	copy and modify this GNU manual. Buying copies from the FSF
	supports it in developing GNU and promoting software freedom.''-->
	<meta http-equiv="Content-Style-Type" content="text/css">
	<style type="text/css"><!--
	pre.display { font-family:inherit }
	pre.format { font-family:inherit }
	pre.smalldisplay { font-family:inherit; font-size:smaller }
	pre.smallformat { font-family:inherit; font-size:smaller }
	pre.smallexample { font-size:smaller }
	pre.smalllisp { font-size:smaller }
	span.sc { font-variant:small-caps }
	span.roman { font-family:serif; font-weight:normal; }
	span.sansserif { font-family:sans-serif; font-weight:normal; }
	--></style>
	<link rel="stylesheet" type="text/css" href="../cs.css">
	</head>
	<body>
	<div class="node">
	<a name="Extended-Char-Intro"></a>
	<p>
	Next: <a rel="next" accesskey="n" href="Charset-Function-Overview.html#Charset-Function-Overview">Charset Function Overview</a>,
	Up: <a rel="up" accesskey="u" href="Character-Set-Handling.html#Character-Set-Handling">Character Set Handling</a>
	<hr>
	</div>

	<h3 class="section">6.1 Introduction to Extended Characters</h3>

	<p>A variety of solutions is available to overcome the differences between
	character sets with a 1:1 relation between bytes and characters and
	character sets with ratios of 2:1 or 4:1. The remainder of this
	section gives a few examples to help understand the design decisions
	made while developing the functionality of the C library<!-- /@w -->.

	<p><a name="index-internal-representation-610"></a>A distinction we have to make right away is between internal and
	external representation. <dfn>Internal representation</dfn> means the
	representation used by a program while keeping the text in memory.
	External representations are used when text is stored or transmitted
	through some communication channel. Examples of external
	representations include files waiting in a directory to be
	read and parsed.

	<p>Traditionally there has been no difference between the two representations.
	It was equally comfortable and useful to use the same single-byte
	representation internally and externally. This comfort level decreases
	with more and larger character sets.

	<p>One of the problems to overcome with the internal representation is
	handling text that is externally encoded using different character
	sets. Assume a program that reads two texts and compares them using
	some metric. The comparison can be usefully done only if the texts are
	internally kept in a common format.

	<p><a name="index-wide-character-611"></a>For such a common format (= character set) eight bits are certainly
	no longer enough. So the smallest entity will have to grow: <dfn>wide
	characters</dfn> will now be used. Instead of one byte per character, two or
	four will be used instead. (Three are not good to address in memory and
	more than four bytes seem not to be necessary).

	<p><a name="index-Unicode-612"></a><a name="index-ISO-10646-613"></a>As shown in some other part of this manual,
	<!-- !!! Ahem, wide char string functions are not yet covered - drepper -->
	a completely new family has been created of functions that can handle wide
	character texts in memory. The most commonly used character sets for such
	internal wide character representations are Unicode and ISO 10646<!-- /@w -->
	(also known as UCS for Universal Character Set). Unicode was originally
	planned as a 16-bit character set; whereas, ISO 10646<!-- /@w --> was designed to
	be a 31-bit large code space. The two standards are practically identical.
	They have the same character repertoire and code table, but Unicode specifies
	added semantics. At the moment, only characters in the first <code>0x10000</code>
	code positions (the so-called Basic Multilingual Plane, BMP) have been
	assigned, but the assignment of more specialized characters outside this
	16-bit space is already in progress. A number of encodings have been
	defined for Unicode and ISO 10646<!-- /@w --> characters:
	<a name="index-UCS_002d2-614"></a><a name="index-UCS_002d4-615"></a><a name="index-UTF_002d8-616"></a><a name="index-UTF_002d16-617"></a>UCS-2 is a 16-bit word that can only represent characters
	from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
	and ISO 10646<!-- /@w --> character, UTF-8 is an ASCII compatible encoding where
	ASCII characters are represented by ASCII bytes and non-ASCII characters
	by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
	of UCS-2 in which pairs of certain UCS-2 words can be used to encode
	non-BMP characters up to <code>0x10ffff</code>.

	<p>To represent wide characters the <code>char</code> type is not suitable. For
	this reason the ISO C<!-- /@w --> standard introduces a new type that is
	designed to keep one character of a wide character string. To maintain
	the similarity there is also a type corresponding to <code>int</code> for
	those functions that take a single wide character.

	<!-- stddef.h -->
	<!-- ISO -->
	<div class="defun">
	— Data type: <b>wchar_t</b><var><a name="index-wchar_005ft-618"></a></var><br>
	<blockquote><p>This data type is used as the base type for wide character strings.
	In other words, arrays of objects of this type are the equivalent of
	<code>char[]</code> for multibyte character strings. The type is defined in
	<samp><span class="file">stddef.h</span></samp>.

	<p>The ISO C90<!-- /@w --> standard, where <code>wchar_t</code> was introduced, does not
	say anything specific about the representation. It only requires that
	this type is capable of storing all elements of the basic character set.
	Therefore it would be legitimate to define <code>wchar_t</code> as <code>char</code>,
	which might make sense for embedded systems.

	<p>But for GNU systems <code>wchar_t</code> is always 32 bits wide and, therefore,
	capable of representing all UCS-4 values and, therefore, covering all of
	ISO 10646<!-- /@w -->. Some Unix systems define <code>wchar_t</code> as a 16-bit type
	and thereby follow Unicode very strictly. This definition is perfectly
	fine with the standard, but it also means that to represent all
	characters from Unicode and ISO 10646<!-- /@w --> one has to use UTF-16 surrogate
	characters, which is in fact a multi-wide-character encoding. But
	resorting to multi-wide-character encoding contradicts the purpose of the
	<code>wchar_t</code> type.
	</p></blockquote></div>

	<!-- wchar.h -->
	<!-- ISO -->
	<div class="defun">
	— Data type: <b>wint_t</b><var><a name="index-wint_005ft-619"></a></var><br>
	<blockquote><p><code>wint_t</code> is a data type used for parameters and variables that
	contain a single wide character. As the name suggests this type is the
	equivalent of <code>int</code> when using the normal <code>char</code> strings. The
	types <code>wchar_t</code> and <code>wint_t</code> often have the same
	representation if their size is 32 bits wide but if <code>wchar_t</code> is
	defined as <code>char</code> the type <code>wint_t</code> must be defined as
	<code>int</code> due to the parameter promotion.

	<p><a name="index-wchar_002eh-620"></a>This type is defined in <samp><span class="file">wchar.h</span></samp> and was introduced in
	Amendment 1<!-- /@w --> to ISO C90<!-- /@w -->.
	</p></blockquote></div>

	<p>As there are for the <code>char</code> data type macros are available for
	specifying the minimum and maximum value representable in an object of
	type <code>wchar_t</code>.

	<!-- wchar.h -->
	<!-- ISO -->
	<div class="defun">
	— Macro: wint_t <b>WCHAR_MIN</b><var><a name="index-WCHAR_005fMIN-621"></a></var><br>
	<blockquote><p>The macro <code>WCHAR_MIN</code> evaluates to the minimum value representable
	by an object of type <code>wint_t</code>.

	<p>This macro was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w -->.
	</p></blockquote></div>

	<!-- wchar.h -->
	<!-- ISO -->
	<div class="defun">
	— Macro: wint_t <b>WCHAR_MAX</b><var><a name="index-WCHAR_005fMAX-622"></a></var><br>
	<blockquote><p>The macro <code>WCHAR_MAX</code> evaluates to the maximum value representable
	by an object of type <code>wint_t</code>.

	<p>This macro was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w -->.
	</p></blockquote></div>

	<p>Another special wide character value is the equivalent to <code>EOF</code>.

	<!-- wchar.h -->
	<!-- ISO -->
	<div class="defun">
	— Macro: wint_t <b>WEOF</b><var><a name="index-WEOF-623"></a></var><br>
	<blockquote><p>The macro <code>WEOF</code> evaluates to a constant expression of type
	<code>wint_t</code> whose value is different from any member of the extended
	character set.

	<p><code>WEOF</code> need not be the same value as <code>EOF</code> and unlike
	<code>EOF</code> it also need <em>not</em> be negative. In other words, sloppy
	code like

	<pre class="smallexample"> {
	int c;
	...
	while ((c = getc (fp)) < 0)
	...
	}
	</pre>
	<p class="noindent">has to be rewritten to use <code>WEOF</code> explicitly when wide characters
	are used:

	<pre class="smallexample"> {
	wint_t c;
	...
	while ((c = wgetc (fp)) != WEOF)
	...
	}
	</pre>
	<p><a name="index-wchar_002eh-624"></a>This macro was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w --> and is
	defined in <samp><span class="file">wchar.h</span></samp>.
	</p></blockquote></div>

	<p>These internal representations present problems when it comes to storing
	and transmittal. Because each single wide character consists of more
	than one byte, they are effected by byte-ordering. Thus, machines with
	different endianesses would see different values when accessing the same
	data. This byte ordering concern also applies for communication protocols
	that are all byte-based and therefore require that the sender has to
	decide about splitting the wide character in bytes. A last (but not least
	important) point is that wide characters often require more storage space
	than a customized byte-oriented character set.

	<p><a name="index-multibyte-character-625"></a><a name="index-EBCDIC-626"></a>For all the above reasons, an external encoding that is different from
	the internal encoding is often used if the latter is UCS-2 or UCS-4.
	The external encoding is byte-based and can be chosen appropriately for
	the environment and for the texts to be handled. A variety of different
	character sets can be used for this external encoding (information that
	will not be exhaustively presented here–instead, a description of the
	major groups will suffice). All of the ASCII-based character sets
	fulfill one requirement: they are "filesystem safe." This means that
	the character <code>'/'</code> is used in the encoding <em>only</em> to
	represent itself. Things are a bit different for character sets like
	EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
	family used by IBM), but if the operation system does not understand
	EBCDIC directly the parameters-to-system calls have to be converted
	first anyhow.

	<ul>
	<li>The simplest character sets are single-byte character sets. There can
	be only up to 256 characters (for 8 bit<!-- /@w --> character sets), which is
	not sufficient to cover all languages but might be sufficient to handle
	a specific text. Handling of a 8 bit<!-- /@w --> character sets is simple. This
	is not true for other kinds presented later, and therefore, the
	application one uses might require the use of 8 bit<!-- /@w --> character sets.

	<p><a name="index-ISO-2022-627"></a><li>The ISO 2022<!-- /@w --> standard defines a mechanism for extended character
	sets where one character <em>can</em> be represented by more than one
	byte. This is achieved by associating a state with the text.
	Characters that can be used to change the state can be embedded in the
	text. Each byte in the text might have a different interpretation in each
	state. The state might even influence whether a given byte stands for a
	character on its own or whether it has to be combined with some more
	bytes.

	<p><a name="index-EUC-628"></a><a name="index-Shift_005fJIS-629"></a><a name="index-SJIS-630"></a>In most uses of ISO 2022<!-- /@w --> the defined character sets do not allow
	state changes that cover more than the next character. This has the
	big advantage that whenever one can identify the beginning of the byte
	sequence of a character one can interpret a text correctly. Examples of
	character sets using this policy are the various EUC character sets
	(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
	or Shift_JIS (SJIS, a Japanese encoding).

	<p>But there are also character sets using a state that is valid for more
	than one character and has to be changed by another byte sequence.
	Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.

	<li><a name="index-ISO-6937-631"></a>Early attempts to fix 8 bit character sets for other languages using the
	Roman alphabet lead to character sets like ISO 6937<!-- /@w -->. Here bytes
	representing characters like the acute accent do not produce output
	themselves: one has to combine them with other characters to get the
	desired result. For example, the byte sequence <code>0xc2 0x61</code>
	(non-spacing acute accent, followed by lower-case `a') to get the “small
	a with acute” character. To get the acute accent character on its own,
	one has to write <code>0xc2 0x20</code> (the non-spacing acute followed by a
	space).

	<p>Character sets like ISO 6937<!-- /@w --> are used in some embedded systems such
	as teletex.

	<li><a name="index-UTF_002d8-632"></a>Instead of converting the Unicode or ISO 10646<!-- /@w --> text used internally,
	it is often also sufficient to simply use an encoding different than
	UCS-2/UCS-4. The Unicode and ISO 10646<!-- /@w --> standards even specify such an
	encoding: UTF-8. This encoding is able to represent all of ISO 10646<!-- /@w --> 31 bits in a byte string of length one to six.

	<p><a name="index-UTF_002d7-633"></a>There were a few other attempts to encode ISO 10646<!-- /@w --> such as UTF-7,
	but UTF-8 is today the only encoding that should be used. In fact, with
	any luck UTF-8 will soon be the only external encoding that has to be
	supported. It proves to be universally usable and its only disadvantage
	is that it favors Roman languages by making the byte string
	representation of other scripts (Cyrillic, Greek, Asian scripts) longer
	than necessary if using a specific character set for these scripts.
	Methods like the Unicode compression scheme can alleviate these
	problems.
	</ul>

	<p>The question remaining is: how to select the character set or encoding
	to use. The answer: you cannot decide about it yourself, it is decided
	by the developers of the system or the majority of the users. Since the
	goal is interoperability one has to use whatever the other people one
	works with use. If there are no constraints, the selection is based on
	the requirements the expected circle of users will have. In other words,
	if a project is expected to be used in only, say, Russia it is fine to use
	KOI8-R or a similar character set. But if at the same time people from,
	say, Greece are participating one should use a character set that allows
	all people to collaborate.

	<p>The most widely useful solution seems to be: go with the most general
	character set, namely ISO 10646<!-- /@w -->. Use UTF-8 as the external encoding
	and problems about users not being able to use their own language
	adequately are a thing of the past.

	<p>One final comment about the choice of the wide character representation
	is necessary at this point. We have said above that the natural choice
	is using Unicode or ISO 10646<!-- /@w -->. This is not required, but at least
	encouraged, by the ISO C<!-- /@w --> standard. The standard defines at least a
	macro <code>__STDC_ISO_10646__</code> that is only defined on systems where
	the <code>wchar_t</code> type encodes ISO 10646<!-- /@w --> characters. If this
	symbol is not defined one should avoid making assumptions about the wide
	character representation. If the programmer uses only the functions
	provided by the C library to handle wide character strings there should
	be no compatibility problems with other systems.

	</body></html>