| <html lang="en"> |
| <head> |
| <title>Extended Char Intro - The GNU C Library</title> |
| <meta http-equiv="Content-Type" content="text/html"> |
| <meta name="description" content="The GNU C Library"> |
| <meta name="generator" content="makeinfo 4.13"> |
| <link title="Top" rel="start" href="index.html#Top"> |
| <link rel="up" href="Character-Set-Handling.html#Character-Set-Handling" title="Character Set Handling"> |
| <link rel="next" href="Charset-Function-Overview.html#Charset-Function-Overview" title="Charset Function Overview"> |
| <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage"> |
| <!-- |
| This file documents the GNU C library. |
| |
| This is Edition 0.12, last updated 2007-10-27, |
| of `The GNU C Library Reference Manual', for version |
| 2.8 (Sourcery G++ Lite 2011.03-41). |
| |
| Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2002, |
| 2003, 2007, 2008, 2010 Free Software Foundation, Inc. |
| |
| Permission is granted to copy, distribute and/or modify this document |
| under the terms of the GNU Free Documentation License, Version 1.3 or |
| any later version published by the Free Software Foundation; with the |
| Invariant Sections being ``Free Software Needs Free Documentation'' |
| and ``GNU Lesser General Public License'', the Front-Cover texts being |
| ``A GNU Manual'', and with the Back-Cover Texts as in (a) below. A |
| copy of the license is included in the section entitled "GNU Free |
| Documentation License". |
| |
| (a) The FSF's Back-Cover Text is: ``You have the freedom to |
| copy and modify this GNU manual. Buying copies from the FSF |
| supports it in developing GNU and promoting software freedom.''--> |
| <meta http-equiv="Content-Style-Type" content="text/css"> |
| <style type="text/css"><!-- |
| pre.display { font-family:inherit } |
| pre.format { font-family:inherit } |
| pre.smalldisplay { font-family:inherit; font-size:smaller } |
| pre.smallformat { font-family:inherit; font-size:smaller } |
| pre.smallexample { font-size:smaller } |
| pre.smalllisp { font-size:smaller } |
| span.sc { font-variant:small-caps } |
| span.roman { font-family:serif; font-weight:normal; } |
| span.sansserif { font-family:sans-serif; font-weight:normal; } |
| --></style> |
| <link rel="stylesheet" type="text/css" href="../cs.css"> |
| </head> |
| <body> |
| <div class="node"> |
| <a name="Extended-Char-Intro"></a> |
| <p> |
| Next: <a rel="next" accesskey="n" href="Charset-Function-Overview.html#Charset-Function-Overview">Charset Function Overview</a>, |
| Up: <a rel="up" accesskey="u" href="Character-Set-Handling.html#Character-Set-Handling">Character Set Handling</a> |
| <hr> |
| </div> |
| |
| <h3 class="section">6.1 Introduction to Extended Characters</h3> |
| |
| <p>A variety of solutions is available to overcome the differences between |
| character sets with a 1:1 relation between bytes and characters and |
| character sets with ratios of 2:1 or 4:1. The remainder of this |
| section gives a few examples to help understand the design decisions |
| made while developing the functionality of the C library<!-- /@w -->. |
| |
| <p><a name="index-internal-representation-610"></a>A distinction we have to make right away is between internal and |
| external representation. <dfn>Internal representation</dfn> means the |
| representation used by a program while keeping the text in memory. |
| External representations are used when text is stored or transmitted |
| through some communication channel. Examples of external |
| representations include files waiting in a directory to be |
| read and parsed. |
| |
| <p>Traditionally there has been no difference between the two representations. |
| It was equally comfortable and useful to use the same single-byte |
| representation internally and externally. This comfort level decreases |
| with more and larger character sets. |
| |
| <p>One of the problems to overcome with the internal representation is |
| handling text that is externally encoded using different character |
| sets. Assume a program that reads two texts and compares them using |
| some metric. The comparison can be usefully done only if the texts are |
| internally kept in a common format. |
| |
| <p><a name="index-wide-character-611"></a>For such a common format (= character set) eight bits are certainly |
| no longer enough. So the smallest entity will have to grow: <dfn>wide |
| characters</dfn> will now be used. Instead of one byte per character, two or |
| four will be used instead. (Three are not good to address in memory and |
| more than four bytes seem not to be necessary). |
| |
| <p><a name="index-Unicode-612"></a><a name="index-ISO-10646-613"></a>As shown in some other part of this manual, |
| <!-- !!! Ahem, wide char string functions are not yet covered - drepper --> |
| a completely new family has been created of functions that can handle wide |
| character texts in memory. The most commonly used character sets for such |
| internal wide character representations are Unicode and ISO 10646<!-- /@w --> |
| (also known as UCS for Universal Character Set). Unicode was originally |
| planned as a 16-bit character set; whereas, ISO 10646<!-- /@w --> was designed to |
| be a 31-bit large code space. The two standards are practically identical. |
| They have the same character repertoire and code table, but Unicode specifies |
| added semantics. At the moment, only characters in the first <code>0x10000</code> |
| code positions (the so-called Basic Multilingual Plane, BMP) have been |
| assigned, but the assignment of more specialized characters outside this |
| 16-bit space is already in progress. A number of encodings have been |
| defined for Unicode and ISO 10646<!-- /@w --> characters: |
| <a name="index-UCS_002d2-614"></a><a name="index-UCS_002d4-615"></a><a name="index-UTF_002d8-616"></a><a name="index-UTF_002d16-617"></a>UCS-2 is a 16-bit word that can only represent characters |
| from the BMP, UCS-4 is a 32-bit word than can represent any Unicode |
| and ISO 10646<!-- /@w --> character, UTF-8 is an ASCII compatible encoding where |
| ASCII characters are represented by ASCII bytes and non-ASCII characters |
| by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension |
| of UCS-2 in which pairs of certain UCS-2 words can be used to encode |
| non-BMP characters up to <code>0x10ffff</code>. |
| |
| <p>To represent wide characters the <code>char</code> type is not suitable. For |
| this reason the ISO C<!-- /@w --> standard introduces a new type that is |
| designed to keep one character of a wide character string. To maintain |
| the similarity there is also a type corresponding to <code>int</code> for |
| those functions that take a single wide character. |
| |
| <!-- stddef.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Data type: <b>wchar_t</b><var><a name="index-wchar_005ft-618"></a></var><br> |
| <blockquote><p>This data type is used as the base type for wide character strings. |
| In other words, arrays of objects of this type are the equivalent of |
| <code>char[]</code> for multibyte character strings. The type is defined in |
| <samp><span class="file">stddef.h</span></samp>. |
| |
| <p>The ISO C90<!-- /@w --> standard, where <code>wchar_t</code> was introduced, does not |
| say anything specific about the representation. It only requires that |
| this type is capable of storing all elements of the basic character set. |
| Therefore it would be legitimate to define <code>wchar_t</code> as <code>char</code>, |
| which might make sense for embedded systems. |
| |
| <p>But for GNU systems <code>wchar_t</code> is always 32 bits wide and, therefore, |
| capable of representing all UCS-4 values and, therefore, covering all of |
| ISO 10646<!-- /@w -->. Some Unix systems define <code>wchar_t</code> as a 16-bit type |
| and thereby follow Unicode very strictly. This definition is perfectly |
| fine with the standard, but it also means that to represent all |
| characters from Unicode and ISO 10646<!-- /@w --> one has to use UTF-16 surrogate |
| characters, which is in fact a multi-wide-character encoding. But |
| resorting to multi-wide-character encoding contradicts the purpose of the |
| <code>wchar_t</code> type. |
| </p></blockquote></div> |
| |
| <!-- wchar.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Data type: <b>wint_t</b><var><a name="index-wint_005ft-619"></a></var><br> |
| <blockquote><p><code>wint_t</code> is a data type used for parameters and variables that |
| contain a single wide character. As the name suggests this type is the |
| equivalent of <code>int</code> when using the normal <code>char</code> strings. The |
| types <code>wchar_t</code> and <code>wint_t</code> often have the same |
| representation if their size is 32 bits wide but if <code>wchar_t</code> is |
| defined as <code>char</code> the type <code>wint_t</code> must be defined as |
| <code>int</code> due to the parameter promotion. |
| |
| <p><a name="index-wchar_002eh-620"></a>This type is defined in <samp><span class="file">wchar.h</span></samp> and was introduced in |
| Amendment 1<!-- /@w --> to ISO C90<!-- /@w -->. |
| </p></blockquote></div> |
| |
| <p>As there are for the <code>char</code> data type macros are available for |
| specifying the minimum and maximum value representable in an object of |
| type <code>wchar_t</code>. |
| |
| <!-- wchar.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Macro: wint_t <b>WCHAR_MIN</b><var><a name="index-WCHAR_005fMIN-621"></a></var><br> |
| <blockquote><p>The macro <code>WCHAR_MIN</code> evaluates to the minimum value representable |
| by an object of type <code>wint_t</code>. |
| |
| <p>This macro was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w -->. |
| </p></blockquote></div> |
| |
| <!-- wchar.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Macro: wint_t <b>WCHAR_MAX</b><var><a name="index-WCHAR_005fMAX-622"></a></var><br> |
| <blockquote><p>The macro <code>WCHAR_MAX</code> evaluates to the maximum value representable |
| by an object of type <code>wint_t</code>. |
| |
| <p>This macro was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w -->. |
| </p></blockquote></div> |
| |
| <p>Another special wide character value is the equivalent to <code>EOF</code>. |
| |
| <!-- wchar.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Macro: wint_t <b>WEOF</b><var><a name="index-WEOF-623"></a></var><br> |
| <blockquote><p>The macro <code>WEOF</code> evaluates to a constant expression of type |
| <code>wint_t</code> whose value is different from any member of the extended |
| character set. |
| |
| <p><code>WEOF</code> need not be the same value as <code>EOF</code> and unlike |
| <code>EOF</code> it also need <em>not</em> be negative. In other words, sloppy |
| code like |
| |
| <pre class="smallexample"> { |
| int c; |
| ... |
| while ((c = getc (fp)) < 0) |
| ... |
| } |
| </pre> |
| <p class="noindent">has to be rewritten to use <code>WEOF</code> explicitly when wide characters |
| are used: |
| |
| <pre class="smallexample"> { |
| wint_t c; |
| ... |
| while ((c = wgetc (fp)) != WEOF) |
| ... |
| } |
| </pre> |
| <p><a name="index-wchar_002eh-624"></a>This macro was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w --> and is |
| defined in <samp><span class="file">wchar.h</span></samp>. |
| </p></blockquote></div> |
| |
| <p>These internal representations present problems when it comes to storing |
| and transmittal. Because each single wide character consists of more |
| than one byte, they are effected by byte-ordering. Thus, machines with |
| different endianesses would see different values when accessing the same |
| data. This byte ordering concern also applies for communication protocols |
| that are all byte-based and therefore require that the sender has to |
| decide about splitting the wide character in bytes. A last (but not least |
| important) point is that wide characters often require more storage space |
| than a customized byte-oriented character set. |
| |
| <p><a name="index-multibyte-character-625"></a><a name="index-EBCDIC-626"></a>For all the above reasons, an external encoding that is different from |
| the internal encoding is often used if the latter is UCS-2 or UCS-4. |
| The external encoding is byte-based and can be chosen appropriately for |
| the environment and for the texts to be handled. A variety of different |
| character sets can be used for this external encoding (information that |
| will not be exhaustively presented here–instead, a description of the |
| major groups will suffice). All of the ASCII-based character sets |
| fulfill one requirement: they are "filesystem safe." This means that |
| the character <code>'/'</code> is used in the encoding <em>only</em> to |
| represent itself. Things are a bit different for character sets like |
| EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set |
| family used by IBM), but if the operation system does not understand |
| EBCDIC directly the parameters-to-system calls have to be converted |
| first anyhow. |
| |
| <ul> |
| <li>The simplest character sets are single-byte character sets. There can |
| be only up to 256 characters (for 8 bit<!-- /@w --> character sets), which is |
| not sufficient to cover all languages but might be sufficient to handle |
| a specific text. Handling of a 8 bit<!-- /@w --> character sets is simple. This |
| is not true for other kinds presented later, and therefore, the |
| application one uses might require the use of 8 bit<!-- /@w --> character sets. |
| |
| <p><a name="index-ISO-2022-627"></a><li>The ISO 2022<!-- /@w --> standard defines a mechanism for extended character |
| sets where one character <em>can</em> be represented by more than one |
| byte. This is achieved by associating a state with the text. |
| Characters that can be used to change the state can be embedded in the |
| text. Each byte in the text might have a different interpretation in each |
| state. The state might even influence whether a given byte stands for a |
| character on its own or whether it has to be combined with some more |
| bytes. |
| |
| <p><a name="index-EUC-628"></a><a name="index-Shift_005fJIS-629"></a><a name="index-SJIS-630"></a>In most uses of ISO 2022<!-- /@w --> the defined character sets do not allow |
| state changes that cover more than the next character. This has the |
| big advantage that whenever one can identify the beginning of the byte |
| sequence of a character one can interpret a text correctly. Examples of |
| character sets using this policy are the various EUC character sets |
| (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) |
| or Shift_JIS (SJIS, a Japanese encoding). |
| |
| <p>But there are also character sets using a state that is valid for more |
| than one character and has to be changed by another byte sequence. |
| Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. |
| |
| <li><a name="index-ISO-6937-631"></a>Early attempts to fix 8 bit character sets for other languages using the |
| Roman alphabet lead to character sets like ISO 6937<!-- /@w -->. Here bytes |
| representing characters like the acute accent do not produce output |
| themselves: one has to combine them with other characters to get the |
| desired result. For example, the byte sequence <code>0xc2 0x61</code> |
| (non-spacing acute accent, followed by lower-case `a') to get the “small |
| a with acute” character. To get the acute accent character on its own, |
| one has to write <code>0xc2 0x20</code> (the non-spacing acute followed by a |
| space). |
| |
| <p>Character sets like ISO 6937<!-- /@w --> are used in some embedded systems such |
| as teletex. |
| |
| <li><a name="index-UTF_002d8-632"></a>Instead of converting the Unicode or ISO 10646<!-- /@w --> text used internally, |
| it is often also sufficient to simply use an encoding different than |
| UCS-2/UCS-4. The Unicode and ISO 10646<!-- /@w --> standards even specify such an |
| encoding: UTF-8. This encoding is able to represent all of ISO 10646<!-- /@w --> 31 bits in a byte string of length one to six. |
| |
| <p><a name="index-UTF_002d7-633"></a>There were a few other attempts to encode ISO 10646<!-- /@w --> such as UTF-7, |
| but UTF-8 is today the only encoding that should be used. In fact, with |
| any luck UTF-8 will soon be the only external encoding that has to be |
| supported. It proves to be universally usable and its only disadvantage |
| is that it favors Roman languages by making the byte string |
| representation of other scripts (Cyrillic, Greek, Asian scripts) longer |
| than necessary if using a specific character set for these scripts. |
| Methods like the Unicode compression scheme can alleviate these |
| problems. |
| </ul> |
| |
| <p>The question remaining is: how to select the character set or encoding |
| to use. The answer: you cannot decide about it yourself, it is decided |
| by the developers of the system or the majority of the users. Since the |
| goal is interoperability one has to use whatever the other people one |
| works with use. If there are no constraints, the selection is based on |
| the requirements the expected circle of users will have. In other words, |
| if a project is expected to be used in only, say, Russia it is fine to use |
| KOI8-R or a similar character set. But if at the same time people from, |
| say, Greece are participating one should use a character set that allows |
| all people to collaborate. |
| |
| <p>The most widely useful solution seems to be: go with the most general |
| character set, namely ISO 10646<!-- /@w -->. Use UTF-8 as the external encoding |
| and problems about users not being able to use their own language |
| adequately are a thing of the past. |
| |
| <p>One final comment about the choice of the wide character representation |
| is necessary at this point. We have said above that the natural choice |
| is using Unicode or ISO 10646<!-- /@w -->. This is not required, but at least |
| encouraged, by the ISO C<!-- /@w --> standard. The standard defines at least a |
| macro <code>__STDC_ISO_10646__</code> that is only defined on systems where |
| the <code>wchar_t</code> type encodes ISO 10646<!-- /@w --> characters. If this |
| symbol is not defined one should avoid making assumptions about the wide |
| character representation. If the programmer uses only the functions |
| provided by the C library to handle wide character strings there should |
| be no compatibility problems with other systems. |
| |
| </body></html> |
| |