| <html lang="en"> |
| <head> |
| <title>Representation of Strings - The GNU C Library</title> |
| <meta http-equiv="Content-Type" content="text/html"> |
| <meta name="description" content="The GNU C Library"> |
| <meta name="generator" content="makeinfo 4.13"> |
| <link title="Top" rel="start" href="index.html#Top"> |
| <link rel="up" href="String-and-Array-Utilities.html#String-and-Array-Utilities" title="String and Array Utilities"> |
| <link rel="next" href="String_002fArray-Conventions.html#String_002fArray-Conventions" title="String/Array Conventions"> |
| <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage"> |
| <!-- |
| This file documents the GNU C library. |
| |
| This is Edition 0.12, last updated 2007-10-27, |
| of `The GNU C Library Reference Manual', for version |
| 2.8 (Sourcery G++ Lite 2011.03-41). |
| |
| Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2002, |
| 2003, 2007, 2008, 2010 Free Software Foundation, Inc. |
| |
| Permission is granted to copy, distribute and/or modify this document |
| under the terms of the GNU Free Documentation License, Version 1.3 or |
| any later version published by the Free Software Foundation; with the |
| Invariant Sections being ``Free Software Needs Free Documentation'' |
| and ``GNU Lesser General Public License'', the Front-Cover texts being |
| ``A GNU Manual'', and with the Back-Cover Texts as in (a) below. A |
| copy of the license is included in the section entitled "GNU Free |
| Documentation License". |
| |
| (a) The FSF's Back-Cover Text is: ``You have the freedom to |
| copy and modify this GNU manual. Buying copies from the FSF |
| supports it in developing GNU and promoting software freedom.''--> |
| <meta http-equiv="Content-Style-Type" content="text/css"> |
| <style type="text/css"><!-- |
| pre.display { font-family:inherit } |
| pre.format { font-family:inherit } |
| pre.smalldisplay { font-family:inherit; font-size:smaller } |
| pre.smallformat { font-family:inherit; font-size:smaller } |
| pre.smallexample { font-size:smaller } |
| pre.smalllisp { font-size:smaller } |
| span.sc { font-variant:small-caps } |
| span.roman { font-family:serif; font-weight:normal; } |
| span.sansserif { font-family:sans-serif; font-weight:normal; } |
| --></style> |
| <link rel="stylesheet" type="text/css" href="../cs.css"> |
| </head> |
| <body> |
| <div class="node"> |
| <a name="Representation-of-Strings"></a> |
| <p> |
| Next: <a rel="next" accesskey="n" href="String_002fArray-Conventions.html#String_002fArray-Conventions">String/Array Conventions</a>, |
| Up: <a rel="up" accesskey="u" href="String-and-Array-Utilities.html#String-and-Array-Utilities">String and Array Utilities</a> |
| <hr> |
| </div> |
| |
| <h3 class="section">5.1 Representation of Strings</h3> |
| |
| <p><a name="index-string_002c-representation-of-456"></a> |
| This section is a quick summary of string concepts for beginning C |
| programmers. It describes how character strings are represented in C |
| and some common pitfalls. If you are already familiar with this |
| material, you can skip this section. |
| |
| <p><a name="index-string-457"></a><a name="index-multibyte-character-string-458"></a>A <dfn>string</dfn> is an array of <code>char</code> objects. But string-valued |
| variables are usually declared to be pointers of type <code>char *</code>. |
| Such variables do not include space for the text of a string; that has |
| to be stored somewhere else—in an array variable, a string constant, |
| or dynamically allocated memory (see <a href="Memory-Allocation.html#Memory-Allocation">Memory Allocation</a>). It's up to |
| you to store the address of the chosen memory space into the pointer |
| variable. Alternatively you can store a <dfn>null pointer</dfn> in the |
| pointer variable. The null pointer does not point anywhere, so |
| attempting to reference the string it points to gets an error. |
| |
| <p><a name="index-wide-character-string-459"></a>“string” normally refers to multibyte character strings as opposed to |
| wide character strings. Wide character strings are arrays of type |
| <code>wchar_t</code> and as for multibyte character strings usually pointers |
| of type <code>wchar_t *</code> are used. |
| |
| <p><a name="index-null-character-460"></a><a name="index-null-wide-character-461"></a>By convention, a <dfn>null character</dfn>, <code>'\0'</code>, marks the end of a |
| multibyte character string and the <dfn>null wide character</dfn>, |
| <code>L'\0'</code>, marks the end of a wide character string. For example, in |
| testing to see whether the <code>char *</code> variable <var>p</var> points to a |
| null character marking the end of a string, you can write |
| <code>!*</code><var>p</var> or <code>*</code><var>p</var><code> == '\0'</code>. |
| |
| <p>A null character is quite different conceptually from a null pointer, |
| although both are represented by the integer <code>0</code>. |
| |
| <p><a name="index-string-literal-462"></a><dfn>String literals</dfn> appear in C program source as strings of |
| characters between double-quote characters (‘<samp><span class="samp">"</span></samp>’) where the initial |
| double-quote character is immediately preceded by a capital ‘<samp><span class="samp">L</span></samp>’ |
| (ell) character (as in <code>L"foo"</code>). In ISO C<!-- /@w -->, string literals |
| can also be formed by <dfn>string concatenation</dfn>: <code>"a" "b"</code> is the |
| same as <code>"ab"</code>. For wide character strings one can either use |
| <code>L"a" L"b"</code> or <code>L"a" "b"</code>. Modification of string literals is |
| not allowed by the GNU C compiler, because literals are placed in |
| read-only storage. |
| |
| <p>Character arrays that are declared <code>const</code> cannot be modified |
| either. It's generally good style to declare non-modifiable string |
| pointers to be of type <code>const char *</code>, since this often allows the |
| C compiler to detect accidental modifications as well as providing some |
| amount of documentation about what your program intends to do with the |
| string. |
| |
| <p>The amount of memory allocated for the character array may extend past |
| the null character that normally marks the end of the string. In this |
| document, the term <dfn>allocated size</dfn> is always used to refer to the |
| total amount of memory allocated for the string, while the term |
| <dfn>length</dfn> refers to the number of characters up to (but not |
| including) the terminating null character. |
| <a name="index-length-of-string-463"></a><a name="index-allocation-size-of-string-464"></a><a name="index-size-of-string-465"></a><a name="index-string-length-466"></a><a name="index-string-allocation-467"></a> |
| A notorious source of program bugs is trying to put more characters in a |
| string than fit in its allocated size. When writing code that extends |
| strings or moves characters into a pre-allocated array, you should be |
| very careful to keep track of the length of the text and make explicit |
| checks for overflowing the array. Many of the library functions |
| <em>do not</em> do this for you! Remember also that you need to allocate |
| an extra byte to hold the null character that marks the end of the |
| string. |
| |
| <p><a name="index-single_002dbyte-string-468"></a><a name="index-multibyte-string-469"></a>Originally strings were sequences of bytes where each byte represents a |
| single character. This is still true today if the strings are encoded |
| using a single-byte character encoding. Things are different if the |
| strings are encoded using a multibyte encoding (for more information on |
| encodings see <a href="Extended-Char-Intro.html#Extended-Char-Intro">Extended Char Intro</a>). There is no difference in |
| the programming interface for these two kind of strings; the programmer |
| has to be aware of this and interpret the byte sequences accordingly. |
| |
| <p>But since there is no separate interface taking care of these |
| differences the byte-based string functions are sometimes hard to use. |
| Since the count parameters of these functions specify bytes a call to |
| <code>strncpy</code> could cut a multibyte character in the middle and put an |
| incomplete (and therefore unusable) byte sequence in the target buffer. |
| |
| <p><a name="index-wide-character-string-470"></a>To avoid these problems later versions of the ISO C<!-- /@w --> standard |
| introduce a second set of functions which are operating on <dfn>wide |
| characters</dfn> (see <a href="Extended-Char-Intro.html#Extended-Char-Intro">Extended Char Intro</a>). These functions don't have |
| the problems the single-byte versions have since every wide character is |
| a legal, interpretable value. This does not mean that cutting wide |
| character strings at arbitrary points is without problems. It normally |
| is for alphabet-based languages (except for non-normalized text) but |
| languages based on syllables still have the problem that more than one |
| wide character is necessary to complete a logical unit. This is a |
| higher level problem which the C library<!-- /@w --> functions are not designed |
| to solve. But it is at least good that no invalid byte sequences can be |
| created. Also, the higher level functions can also much easier operate |
| on wide character than on multibyte characters so that a general advise |
| is to use wide characters internally whenever text is more than simply |
| copied. |
| |
| <p>The remaining of this chapter will discuss the functions for handling |
| wide character strings in parallel with the discussion of the multibyte |
| character strings since there is almost always an exact equivalent |
| available. |
| |
| </body></html> |
| |