| <html lang="en"> |
| <head> |
| <title>Converting Strings - The GNU C Library</title> |
| <meta http-equiv="Content-Type" content="text/html"> |
| <meta name="description" content="The GNU C Library"> |
| <meta name="generator" content="makeinfo 4.13"> |
| <link title="Top" rel="start" href="index.html#Top"> |
| <link rel="up" href="Restartable-multibyte-conversion.html#Restartable-multibyte-conversion" title="Restartable multibyte conversion"> |
| <link rel="prev" href="Converting-a-Character.html#Converting-a-Character" title="Converting a Character"> |
| <link rel="next" href="Multibyte-Conversion-Example.html#Multibyte-Conversion-Example" title="Multibyte Conversion Example"> |
| <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage"> |
| <!-- |
| This file documents the GNU C library. |
| |
| This is Edition 0.12, last updated 2007-10-27, |
| of `The GNU C Library Reference Manual', for version |
| 2.8 (Sourcery G++ Lite 2011.03-41). |
| |
| Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2002, |
| 2003, 2007, 2008, 2010 Free Software Foundation, Inc. |
| |
| Permission is granted to copy, distribute and/or modify this document |
| under the terms of the GNU Free Documentation License, Version 1.3 or |
| any later version published by the Free Software Foundation; with the |
| Invariant Sections being ``Free Software Needs Free Documentation'' |
| and ``GNU Lesser General Public License'', the Front-Cover texts being |
| ``A GNU Manual'', and with the Back-Cover Texts as in (a) below. A |
| copy of the license is included in the section entitled "GNU Free |
| Documentation License". |
| |
| (a) The FSF's Back-Cover Text is: ``You have the freedom to |
| copy and modify this GNU manual. Buying copies from the FSF |
| supports it in developing GNU and promoting software freedom.''--> |
| <meta http-equiv="Content-Style-Type" content="text/css"> |
| <style type="text/css"><!-- |
| pre.display { font-family:inherit } |
| pre.format { font-family:inherit } |
| pre.smalldisplay { font-family:inherit; font-size:smaller } |
| pre.smallformat { font-family:inherit; font-size:smaller } |
| pre.smallexample { font-size:smaller } |
| pre.smalllisp { font-size:smaller } |
| span.sc { font-variant:small-caps } |
| span.roman { font-family:serif; font-weight:normal; } |
| span.sansserif { font-family:sans-serif; font-weight:normal; } |
| --></style> |
| <link rel="stylesheet" type="text/css" href="../cs.css"> |
| </head> |
| <body> |
| <div class="node"> |
| <a name="Converting-Strings"></a> |
| <p> |
| Next: <a rel="next" accesskey="n" href="Multibyte-Conversion-Example.html#Multibyte-Conversion-Example">Multibyte Conversion Example</a>, |
| Previous: <a rel="previous" accesskey="p" href="Converting-a-Character.html#Converting-a-Character">Converting a Character</a>, |
| Up: <a rel="up" accesskey="u" href="Restartable-multibyte-conversion.html#Restartable-multibyte-conversion">Restartable multibyte conversion</a> |
| <hr> |
| </div> |
| |
| <h4 class="subsection">6.3.4 Converting Multibyte and Wide Character Strings</h4> |
| |
| <p>The functions described in the previous section only convert a single |
| character at a time. Most operations to be performed in real-world |
| programs include strings and therefore the ISO C<!-- /@w --> standard also |
| defines conversions on entire strings. However, the defined set of |
| functions is quite limited; therefore, the GNU C library contains a few |
| extensions that can help in some important situations. |
| |
| <!-- wchar.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Function: size_t <b>mbsrtowcs</b> (<var>wchar_t *restrict dst, const char **restrict src, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-mbsrtowcs-655"></a></var><br> |
| <blockquote><p>The <code>mbsrtowcs</code> function (“multibyte string restartable to wide |
| character string”) converts an NUL-terminated multibyte character |
| string at <code>*</code><var>src</var> into an equivalent wide character string, |
| including the NUL wide character at the end. The conversion is started |
| using the state information from the object pointed to by <var>ps</var> or |
| from an internal object of <code>mbsrtowcs</code> if <var>ps</var> is a null |
| pointer. Before returning, the state object is updated to match the state |
| after the last converted character. The state is the initial state if the |
| terminating NUL byte is reached and converted. |
| |
| <p>If <var>dst</var> is not a null pointer, the result is stored in the array |
| pointed to by <var>dst</var>; otherwise, the conversion result is not |
| available since it is stored in an internal buffer. |
| |
| <p>If <var>len</var> wide characters are stored in the array <var>dst</var> before |
| reaching the end of the input string, the conversion stops and <var>len</var> |
| is returned. If <var>dst</var> is a null pointer, <var>len</var> is never checked. |
| |
| <p>Another reason for a premature return from the function call is if the |
| input string contains an invalid multibyte sequence. In this case the |
| global variable <code>errno</code> is set to <code>EILSEQ</code> and the function |
| returns <code>(size_t) -1</code>. |
| |
| <!-- XXX The ISO C9x draft seems to have a problem here. It says that PS --> |
| <!-- is not updated if DST is NULL. This is not said straightforward and --> |
| <!-- none of the other functions is described like this. It would make sense --> |
| <!-- to define the function this way but I don't think it is meant like this. --> |
| <p>In all other cases the function returns the number of wide characters |
| converted during this call. If <var>dst</var> is not null, <code>mbsrtowcs</code> |
| stores in the pointer pointed to by <var>src</var> either a null pointer (if |
| the NUL byte in the input string was reached) or the address of the byte |
| following the last converted multibyte character. |
| |
| <p><a name="index-wchar_002eh-656"></a><code>mbsrtowcs</code> was introduced in Amendment 1<!-- /@w --> to ISO C90<!-- /@w --> and is |
| declared in <samp><span class="file">wchar.h</span></samp>. |
| </p></blockquote></div> |
| |
| <p>The definition of the <code>mbsrtowcs</code> function has one important |
| limitation. The requirement that <var>dst</var> has to be a NUL-terminated |
| string provides problems if one wants to convert buffers with text. A |
| buffer is normally no collection of NUL-terminated strings but instead a |
| continuous collection of lines, separated by newline characters. Now |
| assume that a function to convert one line from a buffer is needed. Since |
| the line is not NUL-terminated, the source pointer cannot directly point |
| into the unmodified text buffer. This means, either one inserts the NUL |
| byte at the appropriate place for the time of the <code>mbsrtowcs</code> |
| function call (which is not doable for a read-only buffer or in a |
| multi-threaded application) or one copies the line in an extra buffer |
| where it can be terminated by a NUL byte. Note that it is not in general |
| possible to limit the number of characters to convert by setting the |
| parameter <var>len</var> to any specific value. Since it is not known how |
| many bytes each multibyte character sequence is in length, one can only |
| guess. |
| |
| <p><a name="index-stateful-657"></a>There is still a problem with the method of NUL-terminating a line right |
| after the newline character, which could lead to very strange results. |
| As said in the description of the <code>mbsrtowcs</code> function above the |
| conversion state is guaranteed to be in the initial shift state after |
| processing the NUL byte at the end of the input string. But this NUL |
| byte is not really part of the text (i.e., the conversion state after |
| the newline in the original text could be something different than the |
| initial shift state and therefore the first character of the next line |
| is encoded using this state). But the state in question is never |
| accessible to the user since the conversion stops after the NUL byte |
| (which resets the state). Most stateful character sets in use today |
| require that the shift state after a newline be the initial state–but |
| this is not a strict guarantee. Therefore, simply NUL-terminating a |
| piece of a running text is not always an adequate solution and, |
| therefore, should never be used in generally used code. |
| |
| <p>The generic conversion interface (see <a href="Generic-Charset-Conversion.html#Generic-Charset-Conversion">Generic Charset Conversion</a>) |
| does not have this limitation (it simply works on buffers, not |
| strings), and the GNU C library contains a set of functions that take |
| additional parameters specifying the maximal number of bytes that are |
| consumed from the input string. This way the problem of |
| <code>mbsrtowcs</code>'s example above could be solved by determining the line |
| length and passing this length to the function. |
| |
| <!-- wchar.h --> |
| <!-- ISO --> |
| <div class="defun"> |
| — Function: size_t <b>wcsrtombs</b> (<var>char *restrict dst, const wchar_t **restrict src, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-wcsrtombs-658"></a></var><br> |
| <blockquote><p>The <code>wcsrtombs</code> function (“wide character string restartable to |
| multibyte string”) converts the NUL-terminated wide character string at |
| <code>*</code><var>src</var> into an equivalent multibyte character string and |
| stores the result in the array pointed to by <var>dst</var>. The NUL wide |
| character is also converted. The conversion starts in the state |
| described in the object pointed to by <var>ps</var> or by a state object |
| locally to <code>wcsrtombs</code> in case <var>ps</var> is a null pointer. If |
| <var>dst</var> is a null pointer, the conversion is performed as usual but the |
| result is not available. If all characters of the input string were |
| successfully converted and if <var>dst</var> is not a null pointer, the |
| pointer pointed to by <var>src</var> gets assigned a null pointer. |
| |
| <p>If one of the wide characters in the input string has no valid multibyte |
| character equivalent, the conversion stops early, sets the global |
| variable <code>errno</code> to <code>EILSEQ</code>, and returns <code>(size_t) -1</code>. |
| |
| <p>Another reason for a premature stop is if <var>dst</var> is not a null |
| pointer and the next converted character would require more than |
| <var>len</var> bytes in total to the array <var>dst</var>. In this case (and if |
| <var>dest</var> is not a null pointer) the pointer pointed to by <var>src</var> is |
| assigned a value pointing to the wide character right after the last one |
| successfully converted. |
| |
| <p>Except in the case of an encoding error the return value of the |
| <code>wcsrtombs</code> function is the number of bytes in all the multibyte |
| character sequences stored in <var>dst</var>. Before returning the state in |
| the object pointed to by <var>ps</var> (or the internal object in case |
| <var>ps</var> is a null pointer) is updated to reflect the state after the |
| last conversion. The state is the initial shift state in case the |
| terminating NUL wide character was converted. |
| |
| <p><a name="index-wchar_002eh-659"></a>The <code>wcsrtombs</code> function was introduced in Amendment 1<!-- /@w --> to |
| ISO C90<!-- /@w --> and is declared in <samp><span class="file">wchar.h</span></samp>. |
| </p></blockquote></div> |
| |
| <p>The restriction mentioned above for the <code>mbsrtowcs</code> function applies |
| here also. There is no possibility of directly controlling the number of |
| input characters. One has to place the NUL wide character at the correct |
| place or control the consumed input indirectly via the available output |
| array size (the <var>len</var> parameter). |
| |
| <!-- wchar.h --> |
| <!-- GNU --> |
| <div class="defun"> |
| — Function: size_t <b>mbsnrtowcs</b> (<var>wchar_t *restrict dst, const char **restrict src, size_t nmc, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-mbsnrtowcs-660"></a></var><br> |
| <blockquote><p>The <code>mbsnrtowcs</code> function is very similar to the <code>mbsrtowcs</code> |
| function. All the parameters are the same except for <var>nmc</var>, which is |
| new. The return value is the same as for <code>mbsrtowcs</code>. |
| |
| <p>This new parameter specifies how many bytes at most can be used from the |
| multibyte character string. In other words, the multibyte character |
| string <code>*</code><var>src</var> need not be NUL-terminated. But if a NUL byte |
| is found within the <var>nmc</var> first bytes of the string, the conversion |
| stops here. |
| |
| <p>This function is a GNU extension. It is meant to work around the |
| problems mentioned above. Now it is possible to convert a buffer with |
| multibyte character text piece for piece without having to care about |
| inserting NUL bytes and the effect of NUL bytes on the conversion state. |
| </p></blockquote></div> |
| |
| <p>A function to convert a multibyte string into a wide character string |
| and display it could be written like this (this is not a really useful |
| example): |
| |
| <pre class="smallexample"> void |
| showmbs (const char *src, FILE *fp) |
| { |
| mbstate_t state; |
| int cnt = 0; |
| memset (&state, '\0', sizeof (state)); |
| while (1) |
| { |
| wchar_t linebuf[100]; |
| const char *endp = strchr (src, '\n'); |
| size_t n; |
| |
| /* <span class="roman">Exit if there is no more line.</span> */ |
| if (endp == NULL) |
| break; |
| |
| n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); |
| linebuf[n] = L'\0'; |
| fprintf (fp, "line %d: \"%S\"\n", linebuf); |
| } |
| } |
| </pre> |
| <p>There is no problem with the state after a call to <code>mbsnrtowcs</code>. |
| Since we don't insert characters in the strings that were not in there |
| right from the beginning and we use <var>state</var> only for the conversion |
| of the given buffer, there is no problem with altering the state. |
| |
| <!-- wchar.h --> |
| <!-- GNU --> |
| <div class="defun"> |
| — Function: size_t <b>wcsnrtombs</b> (<var>char *restrict dst, const wchar_t **restrict src, size_t nwc, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-wcsnrtombs-661"></a></var><br> |
| <blockquote><p>The <code>wcsnrtombs</code> function implements the conversion from wide |
| character strings to multibyte character strings. It is similar to |
| <code>wcsrtombs</code> but, just like <code>mbsnrtowcs</code>, it takes an extra |
| parameter, which specifies the length of the input string. |
| |
| <p>No more than <var>nwc</var> wide characters from the input string |
| <code>*</code><var>src</var> are converted. If the input string contains a NUL |
| wide character in the first <var>nwc</var> characters, the conversion stops at |
| this place. |
| |
| <p>The <code>wcsnrtombs</code> function is a GNU extension and just like |
| <code>mbsnrtowcs</code> helps in situations where no NUL-terminated input |
| strings are available. |
| </p></blockquote></div> |
| |
| </body></html> |
| |