blob: eb6ada9e6d1890efc4cb93a5cc7732eab803a1a2 [file] [log] [blame]
<html lang="en">
<head>
<title>Converting Strings - The GNU C Library</title>
<meta http-equiv="Content-Type" content="text/html">
<meta name="description" content="The GNU C Library">
<meta name="generator" content="makeinfo 4.13">
<link title="Top" rel="start" href="index.html#Top">
<link rel="up" href="Restartable-multibyte-conversion.html#Restartable-multibyte-conversion" title="Restartable multibyte conversion">
<link rel="prev" href="Converting-a-Character.html#Converting-a-Character" title="Converting a Character">
<link rel="next" href="Multibyte-Conversion-Example.html#Multibyte-Conversion-Example" title="Multibyte Conversion Example">
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
<!--
This file documents the GNU C library.
This is Edition 0.12, last updated 2007-10-27,
of `The GNU C Library Reference Manual', for version
2.8 (Sourcery G++ Lite 2011.03-41).
Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2002,
2003, 2007, 2008, 2010 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with the
Invariant Sections being ``Free Software Needs Free Documentation''
and ``GNU Lesser General Public License'', the Front-Cover texts being
``A GNU Manual'', and with the Back-Cover Texts as in (a) below. A
copy of the license is included in the section entitled "GNU Free
Documentation License".
(a) The FSF's Back-Cover Text is: ``You have the freedom to
copy and modify this GNU manual. Buying copies from the FSF
supports it in developing GNU and promoting software freedom.''-->
<meta http-equiv="Content-Style-Type" content="text/css">
<style type="text/css"><!--
pre.display { font-family:inherit }
pre.format { font-family:inherit }
pre.smalldisplay { font-family:inherit; font-size:smaller }
pre.smallformat { font-family:inherit; font-size:smaller }
pre.smallexample { font-size:smaller }
pre.smalllisp { font-size:smaller }
span.sc { font-variant:small-caps }
span.roman { font-family:serif; font-weight:normal; }
span.sansserif { font-family:sans-serif; font-weight:normal; }
--></style>
<link rel="stylesheet" type="text/css" href="../cs.css">
</head>
<body>
<div class="node">
<a name="Converting-Strings"></a>
<p>
Next:&nbsp;<a rel="next" accesskey="n" href="Multibyte-Conversion-Example.html#Multibyte-Conversion-Example">Multibyte Conversion Example</a>,
Previous:&nbsp;<a rel="previous" accesskey="p" href="Converting-a-Character.html#Converting-a-Character">Converting a Character</a>,
Up:&nbsp;<a rel="up" accesskey="u" href="Restartable-multibyte-conversion.html#Restartable-multibyte-conversion">Restartable multibyte conversion</a>
<hr>
</div>
<h4 class="subsection">6.3.4 Converting Multibyte and Wide Character Strings</h4>
<p>The functions described in the previous section only convert a single
character at a time. Most operations to be performed in real-world
programs include strings and therefore the ISO&nbsp;C<!-- /@w --> standard also
defines conversions on entire strings. However, the defined set of
functions is quite limited; therefore, the GNU C library contains a few
extensions that can help in some important situations.
<!-- wchar.h -->
<!-- ISO -->
<div class="defun">
&mdash; Function: size_t <b>mbsrtowcs</b> (<var>wchar_t *restrict dst, const char **restrict src, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-mbsrtowcs-655"></a></var><br>
<blockquote><p>The <code>mbsrtowcs</code> function (&ldquo;multibyte string restartable to wide
character string&rdquo;) converts an NUL-terminated multibyte character
string at <code>*</code><var>src</var> into an equivalent wide character string,
including the NUL wide character at the end. The conversion is started
using the state information from the object pointed to by <var>ps</var> or
from an internal object of <code>mbsrtowcs</code> if <var>ps</var> is a null
pointer. Before returning, the state object is updated to match the state
after the last converted character. The state is the initial state if the
terminating NUL byte is reached and converted.
<p>If <var>dst</var> is not a null pointer, the result is stored in the array
pointed to by <var>dst</var>; otherwise, the conversion result is not
available since it is stored in an internal buffer.
<p>If <var>len</var> wide characters are stored in the array <var>dst</var> before
reaching the end of the input string, the conversion stops and <var>len</var>
is returned. If <var>dst</var> is a null pointer, <var>len</var> is never checked.
<p>Another reason for a premature return from the function call is if the
input string contains an invalid multibyte sequence. In this case the
global variable <code>errno</code> is set to <code>EILSEQ</code> and the function
returns <code>(size_t) -1</code>.
<!-- XXX The ISO C9x draft seems to have a problem here. It says that PS -->
<!-- is not updated if DST is NULL. This is not said straightforward and -->
<!-- none of the other functions is described like this. It would make sense -->
<!-- to define the function this way but I don't think it is meant like this. -->
<p>In all other cases the function returns the number of wide characters
converted during this call. If <var>dst</var> is not null, <code>mbsrtowcs</code>
stores in the pointer pointed to by <var>src</var> either a null pointer (if
the NUL byte in the input string was reached) or the address of the byte
following the last converted multibyte character.
<p><a name="index-wchar_002eh-656"></a><code>mbsrtowcs</code> was introduced in Amendment&nbsp;1<!-- /@w --> to ISO&nbsp;C90<!-- /@w --> and is
declared in <samp><span class="file">wchar.h</span></samp>.
</p></blockquote></div>
<p>The definition of the <code>mbsrtowcs</code> function has one important
limitation. The requirement that <var>dst</var> has to be a NUL-terminated
string provides problems if one wants to convert buffers with text. A
buffer is normally no collection of NUL-terminated strings but instead a
continuous collection of lines, separated by newline characters. Now
assume that a function to convert one line from a buffer is needed. Since
the line is not NUL-terminated, the source pointer cannot directly point
into the unmodified text buffer. This means, either one inserts the NUL
byte at the appropriate place for the time of the <code>mbsrtowcs</code>
function call (which is not doable for a read-only buffer or in a
multi-threaded application) or one copies the line in an extra buffer
where it can be terminated by a NUL byte. Note that it is not in general
possible to limit the number of characters to convert by setting the
parameter <var>len</var> to any specific value. Since it is not known how
many bytes each multibyte character sequence is in length, one can only
guess.
<p><a name="index-stateful-657"></a>There is still a problem with the method of NUL-terminating a line right
after the newline character, which could lead to very strange results.
As said in the description of the <code>mbsrtowcs</code> function above the
conversion state is guaranteed to be in the initial shift state after
processing the NUL byte at the end of the input string. But this NUL
byte is not really part of the text (i.e., the conversion state after
the newline in the original text could be something different than the
initial shift state and therefore the first character of the next line
is encoded using this state). But the state in question is never
accessible to the user since the conversion stops after the NUL byte
(which resets the state). Most stateful character sets in use today
require that the shift state after a newline be the initial state&ndash;but
this is not a strict guarantee. Therefore, simply NUL-terminating a
piece of a running text is not always an adequate solution and,
therefore, should never be used in generally used code.
<p>The generic conversion interface (see <a href="Generic-Charset-Conversion.html#Generic-Charset-Conversion">Generic Charset Conversion</a>)
does not have this limitation (it simply works on buffers, not
strings), and the GNU C library contains a set of functions that take
additional parameters specifying the maximal number of bytes that are
consumed from the input string. This way the problem of
<code>mbsrtowcs</code>'s example above could be solved by determining the line
length and passing this length to the function.
<!-- wchar.h -->
<!-- ISO -->
<div class="defun">
&mdash; Function: size_t <b>wcsrtombs</b> (<var>char *restrict dst, const wchar_t **restrict src, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-wcsrtombs-658"></a></var><br>
<blockquote><p>The <code>wcsrtombs</code> function (&ldquo;wide character string restartable to
multibyte string&rdquo;) converts the NUL-terminated wide character string at
<code>*</code><var>src</var> into an equivalent multibyte character string and
stores the result in the array pointed to by <var>dst</var>. The NUL wide
character is also converted. The conversion starts in the state
described in the object pointed to by <var>ps</var> or by a state object
locally to <code>wcsrtombs</code> in case <var>ps</var> is a null pointer. If
<var>dst</var> is a null pointer, the conversion is performed as usual but the
result is not available. If all characters of the input string were
successfully converted and if <var>dst</var> is not a null pointer, the
pointer pointed to by <var>src</var> gets assigned a null pointer.
<p>If one of the wide characters in the input string has no valid multibyte
character equivalent, the conversion stops early, sets the global
variable <code>errno</code> to <code>EILSEQ</code>, and returns <code>(size_t) -1</code>.
<p>Another reason for a premature stop is if <var>dst</var> is not a null
pointer and the next converted character would require more than
<var>len</var> bytes in total to the array <var>dst</var>. In this case (and if
<var>dest</var> is not a null pointer) the pointer pointed to by <var>src</var> is
assigned a value pointing to the wide character right after the last one
successfully converted.
<p>Except in the case of an encoding error the return value of the
<code>wcsrtombs</code> function is the number of bytes in all the multibyte
character sequences stored in <var>dst</var>. Before returning the state in
the object pointed to by <var>ps</var> (or the internal object in case
<var>ps</var> is a null pointer) is updated to reflect the state after the
last conversion. The state is the initial shift state in case the
terminating NUL wide character was converted.
<p><a name="index-wchar_002eh-659"></a>The <code>wcsrtombs</code> function was introduced in Amendment&nbsp;1<!-- /@w --> to
ISO&nbsp;C90<!-- /@w --> and is declared in <samp><span class="file">wchar.h</span></samp>.
</p></blockquote></div>
<p>The restriction mentioned above for the <code>mbsrtowcs</code> function applies
here also. There is no possibility of directly controlling the number of
input characters. One has to place the NUL wide character at the correct
place or control the consumed input indirectly via the available output
array size (the <var>len</var> parameter).
<!-- wchar.h -->
<!-- GNU -->
<div class="defun">
&mdash; Function: size_t <b>mbsnrtowcs</b> (<var>wchar_t *restrict dst, const char **restrict src, size_t nmc, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-mbsnrtowcs-660"></a></var><br>
<blockquote><p>The <code>mbsnrtowcs</code> function is very similar to the <code>mbsrtowcs</code>
function. All the parameters are the same except for <var>nmc</var>, which is
new. The return value is the same as for <code>mbsrtowcs</code>.
<p>This new parameter specifies how many bytes at most can be used from the
multibyte character string. In other words, the multibyte character
string <code>*</code><var>src</var> need not be NUL-terminated. But if a NUL byte
is found within the <var>nmc</var> first bytes of the string, the conversion
stops here.
<p>This function is a GNU extension. It is meant to work around the
problems mentioned above. Now it is possible to convert a buffer with
multibyte character text piece for piece without having to care about
inserting NUL bytes and the effect of NUL bytes on the conversion state.
</p></blockquote></div>
<p>A function to convert a multibyte string into a wide character string
and display it could be written like this (this is not a really useful
example):
<pre class="smallexample"> void
showmbs (const char *src, FILE *fp)
{
mbstate_t state;
int cnt = 0;
memset (&amp;state, '\0', sizeof (state));
while (1)
{
wchar_t linebuf[100];
const char *endp = strchr (src, '\n');
size_t n;
/* <span class="roman">Exit if there is no more line.</span> */
if (endp == NULL)
break;
n = mbsnrtowcs (linebuf, &amp;src, endp - src, 99, &amp;state);
linebuf[n] = L'\0';
fprintf (fp, "line %d: \"%S\"\n", linebuf);
}
}
</pre>
<p>There is no problem with the state after a call to <code>mbsnrtowcs</code>.
Since we don't insert characters in the strings that were not in there
right from the beginning and we use <var>state</var> only for the conversion
of the given buffer, there is no problem with altering the state.
<!-- wchar.h -->
<!-- GNU -->
<div class="defun">
&mdash; Function: size_t <b>wcsnrtombs</b> (<var>char *restrict dst, const wchar_t **restrict src, size_t nwc, size_t len, mbstate_t *restrict ps</var>)<var><a name="index-wcsnrtombs-661"></a></var><br>
<blockquote><p>The <code>wcsnrtombs</code> function implements the conversion from wide
character strings to multibyte character strings. It is similar to
<code>wcsrtombs</code> but, just like <code>mbsnrtowcs</code>, it takes an extra
parameter, which specifies the length of the input string.
<p>No more than <var>nwc</var> wide characters from the input string
<code>*</code><var>src</var> are converted. If the input string contains a NUL
wide character in the first <var>nwc</var> characters, the conversion stops at
this place.
<p>The <code>wcsnrtombs</code> function is a GNU extension and just like
<code>mbsnrtowcs</code> helps in situations where no NUL-terminated input
strings are available.
</p></blockquote></div>
</body></html>