src/newlib/winsup/doc/setup-locale.xml - stadia-controller/gcc-arm-none-eabi - Git at Google

 <?xml version="1.0" encoding='UTF-8'?>
 <!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook V4.5//EN"
 		"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">

 <sect1 id="setup-locale"><title>Internationalization</title>

 <sect2 id="setup-locale-ov"><title>Overview</title>

 <para>
 Internationalization support is controlled by the <envar>LANG</envar> and
 <envar>LC_xxx</envar> environment variables.  You can set all of them
 but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
 <envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
 to the POSIX standard.  The content of these variables should follow the
 POSIX standard for a locale specifier.  The correct form of a locale
 specifier is</para>

 <screen>
   language[[_TERRITORY][.charset][@modifier]]
 </screen>

 <para>"language" is a lowercase two character string per ISO 639-1, or,
 if there is no ISO 639-1 code for the language (for instance, "Lower Sorbian"),
 a three character string per ISO 639-3.</para>

 <para>"TERRITORY" is an uppercase two character string per ISO 3166, charset is
 one of a list of supported character sets.  The modifier doesn't matter
 here (though some are recognized, see below).  If you're interested in the
 exact description, you can find it in the online publication of the POSIX
 manual pages on the homepage of the
 <ulink url="http://www.opengroup.org/">Open Group</ulink>.</para>

 <para>Typical locale specifiers are</para>

 <screen>
   "de_CH"	   language = German, territory = Switzerland, default charset
   "fr_FR.UTF-8"    language = french, territory = France, charset = UTF-8
   "ko_KR.eucKR"    language = korean, territory = South Korea, charset = eucKR
   "syr_SY"         language = Syriac, territory = Syria, default charset
 </screen>

 <para>
 If the locale specifier does not follow the above form, Cygwin checks
 if the locale is one of the locale aliases defined in the file
 <filename>/usr/share/locale/locale.alias</filename>.  If so, and if
 the replacement localename is supported by the underlying Windows,
 the locale is accepted, too.  So, given the default content of the
 <filename>/usr/share/locale/locale.alias</filename> file, the below
 examples would be valid locale specifiers as well.
 </para>

 <screen>
   "catalan"        defined as "ca_ES.ISO-8859-1" in locale.alias
   "japanese"       defined as "ja_JP.eucJP"      in locale.alias
   "turkish"        defined as "tr_TR.ISO-8859-9" in locale.alias
 </screen>

 <para>The file <filename>/usr/share/locale/locale.alias</filename> is
 provided by the gettext package under Cygwin.</para>

 <para>
 At application startup, the application's locale is set to the default
 "C" or "POSIX" locale.  This locale defaults to the ASCII character set
 on the application level.  If you want to stick to the "C" locale and only
 change to another charset, you can define this by setting one of the locale
 environment variables to "C.charset".  For instance</para>

 <screen>
   "C.ISO-8859-1"
 </screen>

 <note><para>The default locale in the absence of the aforementioned locale
 environment variables is "C.UTF-8".</para></note>

 <para>Windows uses the UTF-16 charset exclusively to store the names
 of any object used by the Operating System.  This is especially important
 with filenames.  Cygwin uses the setting of the locale environment variables
 <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, and <envar>LANG</envar>, to
 determine how to convert Windows filenames from their UTF-16 representation
 to the singlebyte or multibyte character set used by Cygwin.</para>

 <para>
 The setting of the locale environment variables at process startup
 is effective for Cygwin's internal conversions to and from the Windows UTF-16
 object names for the entire lifetime of the current process.  Changing
 the environment variables to another value changes the way filenames are
 converted in subsequently started child processes, but not within the same
 process.</para>

 <para>
 However, even if one of the locale environment variables is set to
 some other value than "C", this does <emphasis>only</emphasis> affect
 how Cygwin itself converts filenames.  As the POSIX standard requires,
 it's the application's responsibility to activate that locale for its
 own purposes, typically by using the call</para>

 <screen>
   setlocale (LC_ALL, "");
 </screen>

 <para>early in the application code.  Again, so that this doesn't get
 lost:  If the application calls setlocale as above, and there is none
 of the important locale variables set in the environment, the locale
 is set to the default locale, which is "C.UTF-8".</para>

 <para>But what about applications which are not locale-aware?  Per POSIX,
 they are running in the "C" or "POSIX" locale, which implies the ASCII
 charset.  The Cygwin DLL itself, however, will nevertheless use the locale
 set in the environment (or the "C.UTF-8" default locale) for converting
 filenames etc.</para>

 <para>When the locale in the environment specifies an ASCII charset,
 for example "C" or "en_US.ASCII", Cygwin will still use UTF-8
 under the hood to translate filenames.  This allows for easier
 interoperability with applications running in the default "C.UTF-8" locale.
 </para>

 <para>
 The language and territory are used to fetch locale-dependent information
 from Windows.  If the language and territory are not known to Windows, the
 <function>setlocale</function> function fails.</para>

 <para>The following modifiers are recognized.  Any other modifier is simply
 ignored for now.</para>

 <itemizedlist mark="bullet">

 <listitem><para>
 For locales which use the Euro (EUR) as currency, the modifier "@euro"
 can be added to enforce usage of the ISO-8859-15 character set, which
 includes a character for the "Euro" currency sign.
 </para></listitem>

 <listitem><para>
 The default script used for all Serbian language locales (sr_BA, sr_ME, sr_RS,
 and the deprecated sr_CS and sr_SP) is cyrillic.  With the "@latin" modifier
 it gets switched to the latin script with the respective collation behaviour.
 </para></listitem>

 <listitem><para>
 The default charset of the "be_BY" locale (Belarusian/Belarus) is CP1251.
 With the "@latin" modifier it's UTF-8.
 </para></listitem>

 <listitem><para>
 The default charset of the "tt_RU" locale (Tatar/Russia) is ISO-8859-5.
 With the "@iqtelif" modifier it's UTF-8.
 </para></listitem>

 <listitem><para>
 The default charset of the "uz_UZ" locale (Uzbek/Uzbekistan) is ISO-8859-1.
 With the "@cyrillic" modifier it's UTF-8.
 </para></listitem>

 <listitem><para>
 There's a class of characters in the Unicode character set, called the
 "CJK Ambiguous Width" characters.  For these characters, the width
 returned by the wcwidth/wcswidth functions is usually 1.  This can be a
 problem with East-Asian languages, which historically use character sets
 where these characters have a width of 2.  Therefore, wcwidth/wcswidth
 return 2 as the width of these characters when an East-Asian charset such
 as GBK or SJIS is selected, or when UTF-8 is selected and the language is
 specified as "zh" (Chinese), "ja" (Japanese), or "ko" (Korean).  This is
 not correct in all circumstances, hence the locale modifier "@cjknarrow"
 can be used to force wcwidth/wcswidth to return 1 for the ambiguous width
 characters.
 </para></listitem>

 </itemizedlist>

 </sect2>

 <sect2 id="setup-locale-how"><title>How to set the locale</title>

 <itemizedlist mark="bullet">

 <listitem><para>
 Assume that you've set one of the aforementioned environment variables to some
 valid POSIX locale value, other than "C" and "POSIX".  Assume further that
 you're living in Japan.  You might want to use the language code "ja" and the
 territory "JP", thus setting, say, <envar>LANG</envar> to "ja_JP".  You didn't
 set a character set, so what will Cygwin use now?  The default character set
 is determined by the default Windows ANSI codepage for this language and
 territory.  Cygwin uses a character set which is the typical Unix-equivalent
 to the Windows ANSI codepage.  For instance:</para>

 <screen>
   "en_US"		ISO-8859-1
   "el_GR"		ISO-8859-7
   "pl_PL"		ISO-8859-2
   "pl_PL@euro"		ISO-8859-15
   "ja_JP"		EUCJP
   "ko_KR"		EUCKR
   "te_IN"		UTF-8
 </screen>
 </listitem>

 <listitem><para>
 You don't want to use the default character set?  In that case you have to
 specify the charset explicitly.  For instance, assume you're from Japan and
 don't want to use the japanese default charset EUC-JP, but the Windows
 default charset SJIS.  What you can do, for instance, is to set the
 <envar>LANG</envar> variable in the <command>mintty</command> Cygwin Terminal
 in the "Text" section of its "Options" dialog.  If you're starting your
 Cygwin session via a batch file or a shortcut to a batch file, you can also
 just set LANG there:</para>

 <screen>
   @echo off

   C:
   chdir C:\cygwin\bin
   set LANG=ja_JP.SJIS
   bash --login -i
 </screen>

 <note><para>For a list of locales supported by your Windows machine, use the new
 <command>locale -a</command> command, which is part of the Cygwin package.
 For a description see <xref linkend="locale"></xref></para></note>

 <note><para>For a list of supported character sets, see
 <xref linkend="setup-locale-charsetlist"></xref>
 </para></note>
 </listitem>

 <listitem><para>
 Last, but not least, most singlebyte or doublebyte charsets have a big
 disadvantage.  Windows filesystems use the Unicode character set in the
 UTF-16 encoding to store filename information.  Not all characters
 from the Unicode character set are available in a singlebyte or doublebyte
 charset.  While Cygwin has a workaround to access files with unusual
 characters (see <xref linkend="pathnames-unusual"></xref>), a better
 workaround is to use always the UTF-8 character set.</para>

 <para><emphasis>UTF-8 is the only multibyte character set which can represent
 every Unicode character.</emphasis></para>

 <screen>
   set LANG=es_MX.UTF-8
 </screen>

 <para>For a description of the Unicode standard, see the homepage of the
 <ulink url="http://www.unicode.org/">Unicode Consortium</ulink>.
 </para></listitem>

 </itemizedlist>

 </sect2>

 <sect2 id="setup-locale-console"><title>The Windows Console character set</title>

 <para>Sometimes the Windows console is used to run Cygwin applications.
 While terminal emulations like the Cygwin Terminal <command>mintty</command>
 or <command>xterm</command> have a distinct way to set the character set
 used for in- and output, the Windows console hasn't such a way, since it's
 not an application in its own right.</para>

 <para>This problem is solved in Cygwin as follows.  When a Cygwin
 process is started in a Windows console (either explicitly from cmd.exe,
 or implicitly by, for instance, running the
 <filename>C:\cygwin\Cygwin.bat</filename> batch file), the Console character
 set is determined by the setting of the aforementioned
 internationalization environment variables, the same way as described in
 <xref linkend="setup-locale-how"></xref>.  </para>

 <para>What is that good for?  Why not switch the console character set with
 the applications requirements?  After all, the application knows if it uses
 localization or not.  However, what if a non-localized application calls
 a remote application which itself is localized?  This can happen with
 <command>ssh</command> or <command>rlogin</command>.  Both commands don't
 have and don't need localization and they never call
 <function>setlocale</function>.  Setting one of the internationalization
 environment variable to the same charset as the remote machine before
 starting <command>ssh</command> or <command>rlogin</command> fixes that
 problem.</para>

 </sect2>

 <sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title>

 <para>
 You can set the above internationalization variables not only when
 starting the first Cygwin process, but also in your Cygwin shell on the
 fly, even switch to yet another character set, and yet another.  In bash
 for instance:</para>

 <screen>
   <prompt>bash$</prompt> export LC_CTYPE="nl_BE.UTF-8"
 </screen>

 <para>However, here's a problem.  At the start of the first Cygwin process
 in a session, the Windows environment is converted from UTF-16 to UTF-8.
 The environment is another of the system objects stored in UTF-16 in
 Windows.</para>

 <para>As long as the environment only contains ASCII characters, this is
 no problem at all.  But if it contains native characters, and you're planning
 to use, say, GBK, the environment will result in invalid characters in
 the GBK charset.  This would be especially a problem in variables like
 <envar>PATH</envar>.  To circumvent the worst problems, Cygwin converts
 the <envar>PATH</envar> environment variable to the charset set in the
 environment, if it's different from the UTF-8 charset.</para>

 <note><para>Per POSIX, the name of an environment variable should only
 consist of valid ASCII characters, and only of uppercase letters, digits, and
 the underscore for maximum portability.</para></note>

 <para>Very old symbolic links may pose a problem when switching charsets on
 the fly.  A symbolic link contains the filename of the target file the
 symlink points to.  When a symlink had been created with versions of Cygwin
 prior to Cygwin 1.7, the current ANSI or OEM character set had been used to
 store the target filename, dependent on the old <envar>CYGWIN</envar>
 environment variable setting <envar>codepage</envar> (see <xref
 linkend="cygwinenv-removed-options"></xref>.  If the target filename
 contains non-ASCII characters and you use another character set than
 your default ANSI/OEM charset, the target filename of the symlink is now
 potentially an invalid character sequence in the new character set.
 This behaviour is not different from the behaviour in other Operating
 Systems.  Recreate the symlink if that happens to you.</para>

 </sect2>

 <sect2 id="setup-locale-charsetlist"><title>List of supported character sets</title>

 <para>Last but not least, here's the list of currently supported character
 sets.  The left-hand expression is the name of the charset, as you would use
 it in the internationalization environment variables as outlined above.
 Note that charset specifiers are case-insensitive.  <literal>EUCJP</literal>
 is equivalent to <literal>eucJP</literal> or <literal>eUcJp</literal>.
 Writing the charset in the exact case as given in the list below is a
 good convention, though.
 </para>

 <para>The right-hand side is the number of the equivalent Windows
 codepage as well as the Windows name of the codepage.  They are only
 noted here for reference.  Don't try to use the bare codepage number or
 the Windows name of the codepage as charset in locale specifiers, unless
 they happen to be identical with the left-hand side.  Especially in case
 of the "CPxxx" style charsets, always use them with the trailing "CP".</para>

 <para>This works:</para>

 <screen>
   set LC_ALL=en_US.CP437
 </screen>

 <para>This does <emphasis>not</emphasis> work:</para>

 <screen>
   set LC_ALL=en_US.437
 </screen>

 <para>You can find a full list of Windows codepages on the Microsoft MSDN page
 <ulink url="http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx">Code Page Identifiers</ulink>.</para>

 <screen>
     Charset               Codepage
     -------------------   -------------------------------------------
     ASCII                 20127 (US_ASCII)

     CP437                   437 (OEM United States)
     CP720                   720 (DOS Arabic)
     CP737                   737 (OEM Greek)
     CP775                   775 (OEM Baltic)
     CP850                   850 (OEM Latin 1, Western European)
     CP852                   852 (OEM Latin 2, Central European)
     CP855                   855 (OEM Cyrillic)
     CP857                   857 (OEM Turkish)
     CP858                   858 (OEM Latin 1 + Euro Symbol)
     CP862                   862 (OEM Hebrew)
     CP866                   866 (OEM Russian)
     CP874                   874 (ANSI/OEM Thai)
     CP932		    932 (Shift_JIS, not exactly identical to SJIS)
     CP1125                 1125 (OEM Ukraine)
     CP1250                 1250 (ANSI Central European)
     CP1251                 1251 (ANSI Cyrillic)
     CP1252                 1252 (ANSI Latin 1, Western European)
     CP1253                 1253 (ANSI Greek)
     CP1254                 1254 (ANSI Turkish)
     CP1255                 1255 (ANSI Hebrew)
     CP1256                 1256 (ANSI Arabic)
     CP1257                 1257 (ANSI Baltic)
     CP1258                 1258 (ANSI/OEM Vietnamese)

     ISO-8859-1            28591 (ISO-8859-1)
     ISO-8859-2            28592 (ISO-8859-2)
     ISO-8859-3            28593 (ISO-8859-3)
     ISO-8859-4            28594 (ISO-8859-4)
     ISO-8859-5            28595 (ISO-8859-5)
     ISO-8859-6            28596 (ISO-8859-6)
     ISO-8859-7            28597 (ISO-8859-7)
     ISO-8859-8            28598 (ISO-8859-8)
     ISO-8859-9            28599 (ISO-8859-9)
     ISO-8859-10             -   (not available)
     ISO-8859-11             -   (not available)
     ISO-8859-13           28603 (ISO-8859-13)
     ISO-8859-14             -   (not available)
     ISO-8859-15           28605 (ISO-8859-15)
     ISO-8859-16             -   (not available)

     Big5                    950 (ANSI/OEM Traditional Chinese)
     EUCCN or euc-CN         936 (ANSI/OEM Simplified Chinese)
     EUCJP or euc-JP       20932 (EUC Japanese)
     EUCKR or euc-KR         949 (EUC Korean)
     GB2312                  936 (ANSI/OEM Simplified Chinese)
     GBK                     936 (ANSI/OEM Simplified Chinese)
     GEORGIAN-PS             -   (not available)
     KOI8-R                20866 (KOI8-R Russian Cyrillic)
     KOI8-U                21866 (KOI8-U Ukrainian Cyrillic)
     PT154                   -   (not available)
     SJIS                    -   (not available, almost, but not exactly CP932)
     TIS620 or TIS-620       874 (ANSI/OEM Thai)

     UTF-8 or utf8         65001 (UTF-8)
 </screen>

 </sect2>

 </sect1>
	<?xml version="1.0" encoding='UTF-8'?>
	<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook V4.5//EN"
	"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">

	<sect1 id="setup-locale"><title>Internationalization</title>

	<sect2 id="setup-locale-ov"><title>Overview</title>

	<para>
	Internationalization support is controlled by the <envar>LANG</envar> and
	<envar>LC_xxx</envar> environment variables. You can set all of them
	but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
	<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
	to the POSIX standard. The content of these variables should follow the
	POSIX standard for a locale specifier. The correct form of a locale
	specifier is</para>

	<screen>
	language[[_TERRITORY][.charset][@modifier]]
	</screen>

	<para>"language" is a lowercase two character string per ISO 639-1, or,
	if there is no ISO 639-1 code for the language (for instance, "Lower Sorbian"),
	a three character string per ISO 639-3.</para>

	<para>"TERRITORY" is an uppercase two character string per ISO 3166, charset is
	one of a list of supported character sets. The modifier doesn't matter
	here (though some are recognized, see below). If you're interested in the
	exact description, you can find it in the online publication of the POSIX
	manual pages on the homepage of the
	<ulink url="http://www.opengroup.org/">Open Group</ulink>.</para>

	<para>Typical locale specifiers are</para>

	<screen>
	"de_CH" language = German, territory = Switzerland, default charset
	"fr_FR.UTF-8" language = french, territory = France, charset = UTF-8
	"ko_KR.eucKR" language = korean, territory = South Korea, charset = eucKR
	"syr_SY" language = Syriac, territory = Syria, default charset
	</screen>

	<para>
	If the locale specifier does not follow the above form, Cygwin checks
	if the locale is one of the locale aliases defined in the file
	<filename>/usr/share/locale/locale.alias</filename>. If so, and if
	the replacement localename is supported by the underlying Windows,
	the locale is accepted, too. So, given the default content of the
	<filename>/usr/share/locale/locale.alias</filename> file, the below
	examples would be valid locale specifiers as well.
	</para>

	<screen>
	"catalan" defined as "ca_ES.ISO-8859-1" in locale.alias
	"japanese" defined as "ja_JP.eucJP" in locale.alias
	"turkish" defined as "tr_TR.ISO-8859-9" in locale.alias
	</screen>

	<para>The file <filename>/usr/share/locale/locale.alias</filename> is
	provided by the gettext package under Cygwin.</para>

	<para>
	At application startup, the application's locale is set to the default
	"C" or "POSIX" locale. This locale defaults to the ASCII character set
	on the application level. If you want to stick to the "C" locale and only
	change to another charset, you can define this by setting one of the locale
	environment variables to "C.charset". For instance</para>

	<screen>
	"C.ISO-8859-1"
	</screen>

	<note><para>The default locale in the absence of the aforementioned locale
	environment variables is "C.UTF-8".</para></note>

	<para>Windows uses the UTF-16 charset exclusively to store the names
	of any object used by the Operating System. This is especially important
	with filenames. Cygwin uses the setting of the locale environment variables
	<envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, and <envar>LANG</envar>, to
	determine how to convert Windows filenames from their UTF-16 representation
	to the singlebyte or multibyte character set used by Cygwin.</para>

	<para>
	The setting of the locale environment variables at process startup
	is effective for Cygwin's internal conversions to and from the Windows UTF-16
	object names for the entire lifetime of the current process. Changing
	the environment variables to another value changes the way filenames are
	converted in subsequently started child processes, but not within the same
	process.</para>

	<para>
	However, even if one of the locale environment variables is set to
	some other value than "C", this does <emphasis>only</emphasis> affect
	how Cygwin itself converts filenames. As the POSIX standard requires,
	it's the application's responsibility to activate that locale for its
	own purposes, typically by using the call</para>

	<screen>
	setlocale (LC_ALL, "");
	</screen>

	<para>early in the application code. Again, so that this doesn't get
	lost: If the application calls setlocale as above, and there is none
	of the important locale variables set in the environment, the locale
	is set to the default locale, which is "C.UTF-8".</para>

	<para>But what about applications which are not locale-aware? Per POSIX,
	they are running in the "C" or "POSIX" locale, which implies the ASCII
	charset. The Cygwin DLL itself, however, will nevertheless use the locale
	set in the environment (or the "C.UTF-8" default locale) for converting
	filenames etc.</para>

	<para>When the locale in the environment specifies an ASCII charset,
	for example "C" or "en_US.ASCII", Cygwin will still use UTF-8
	under the hood to translate filenames. This allows for easier
	interoperability with applications running in the default "C.UTF-8" locale.
	</para>

	<para>
	The language and territory are used to fetch locale-dependent information
	from Windows. If the language and territory are not known to Windows, the
	<function>setlocale</function> function fails.</para>

	<para>The following modifiers are recognized. Any other modifier is simply
	ignored for now.</para>

	<itemizedlist mark="bullet">

	<listitem><para>
	For locales which use the Euro (EUR) as currency, the modifier "@euro"
	can be added to enforce usage of the ISO-8859-15 character set, which
	includes a character for the "Euro" currency sign.
	</para></listitem>

	<listitem><para>
	The default script used for all Serbian language locales (sr_BA, sr_ME, sr_RS,
	and the deprecated sr_CS and sr_SP) is cyrillic. With the "@latin" modifier
	it gets switched to the latin script with the respective collation behaviour.
	</para></listitem>

	<listitem><para>
	The default charset of the "be_BY" locale (Belarusian/Belarus) is CP1251.
	With the "@latin" modifier it's UTF-8.
	</para></listitem>

	<listitem><para>
	The default charset of the "tt_RU" locale (Tatar/Russia) is ISO-8859-5.
	With the "@iqtelif" modifier it's UTF-8.
	</para></listitem>

	<listitem><para>
	The default charset of the "uz_UZ" locale (Uzbek/Uzbekistan) is ISO-8859-1.
	With the "@cyrillic" modifier it's UTF-8.
	</para></listitem>

	<listitem><para>
	There's a class of characters in the Unicode character set, called the
	"CJK Ambiguous Width" characters. For these characters, the width
	returned by the wcwidth/wcswidth functions is usually 1. This can be a
	problem with East-Asian languages, which historically use character sets
	where these characters have a width of 2. Therefore, wcwidth/wcswidth
	return 2 as the width of these characters when an East-Asian charset such
	as GBK or SJIS is selected, or when UTF-8 is selected and the language is
	specified as "zh" (Chinese), "ja" (Japanese), or "ko" (Korean). This is
	not correct in all circumstances, hence the locale modifier "@cjknarrow"
	can be used to force wcwidth/wcswidth to return 1 for the ambiguous width
	characters.
	</para></listitem>

	</itemizedlist>

	</sect2>

	<sect2 id="setup-locale-how"><title>How to set the locale</title>

	<itemizedlist mark="bullet">

	<listitem><para>
	Assume that you've set one of the aforementioned environment variables to some
	valid POSIX locale value, other than "C" and "POSIX". Assume further that
	you're living in Japan. You might want to use the language code "ja" and the
	territory "JP", thus setting, say, <envar>LANG</envar> to "ja_JP". You didn't
	set a character set, so what will Cygwin use now? The default character set
	is determined by the default Windows ANSI codepage for this language and
	territory. Cygwin uses a character set which is the typical Unix-equivalent
	to the Windows ANSI codepage. For instance:</para>

	<screen>
	"en_US" ISO-8859-1
	"el_GR" ISO-8859-7
	"pl_PL" ISO-8859-2
	"pl_PL@euro" ISO-8859-15
	"ja_JP" EUCJP
	"ko_KR" EUCKR
	"te_IN" UTF-8
	</screen>
	</listitem>

	<listitem><para>
	You don't want to use the default character set? In that case you have to
	specify the charset explicitly. For instance, assume you're from Japan and
	don't want to use the japanese default charset EUC-JP, but the Windows
	default charset SJIS. What you can do, for instance, is to set the
	<envar>LANG</envar> variable in the <command>mintty</command> Cygwin Terminal
	in the "Text" section of its "Options" dialog. If you're starting your
	Cygwin session via a batch file or a shortcut to a batch file, you can also
	just set LANG there:</para>

	<screen>
	@echo off

	C:
	chdir C:\cygwin\bin
	set LANG=ja_JP.SJIS
	bash --login -i
	</screen>

	<note><para>For a list of locales supported by your Windows machine, use the new
	<command>locale -a</command> command, which is part of the Cygwin package.
	For a description see <xref linkend="locale"></xref></para></note>

	<note><para>For a list of supported character sets, see
	<xref linkend="setup-locale-charsetlist"></xref>
	</para></note>
	</listitem>

	<listitem><para>
	Last, but not least, most singlebyte or doublebyte charsets have a big
	disadvantage. Windows filesystems use the Unicode character set in the
	UTF-16 encoding to store filename information. Not all characters
	from the Unicode character set are available in a singlebyte or doublebyte
	charset. While Cygwin has a workaround to access files with unusual
	characters (see <xref linkend="pathnames-unusual"></xref>), a better
	workaround is to use always the UTF-8 character set.</para>

	<para><emphasis>UTF-8 is the only multibyte character set which can represent
	every Unicode character.</emphasis></para>

	<screen>
	set LANG=es_MX.UTF-8
	</screen>

	<para>For a description of the Unicode standard, see the homepage of the
	<ulink url="http://www.unicode.org/">Unicode Consortium</ulink>.
	</para></listitem>

	</itemizedlist>

	</sect2>

	<sect2 id="setup-locale-console"><title>The Windows Console character set</title>

	<para>Sometimes the Windows console is used to run Cygwin applications.
	While terminal emulations like the Cygwin Terminal <command>mintty</command>
	or <command>xterm</command> have a distinct way to set the character set
	used for in- and output, the Windows console hasn't such a way, since it's
	not an application in its own right.</para>

	<para>This problem is solved in Cygwin as follows. When a Cygwin
	process is started in a Windows console (either explicitly from cmd.exe,
	or implicitly by, for instance, running the
	<filename>C:\cygwin\Cygwin.bat</filename> batch file), the Console character
	set is determined by the setting of the aforementioned
	internationalization environment variables, the same way as described in
	<xref linkend="setup-locale-how"></xref>. </para>

	<para>What is that good for? Why not switch the console character set with
	the applications requirements? After all, the application knows if it uses
	localization or not. However, what if a non-localized application calls
	a remote application which itself is localized? This can happen with
	<command>ssh</command> or <command>rlogin</command>. Both commands don't
	have and don't need localization and they never call
	<function>setlocale</function>. Setting one of the internationalization
	environment variable to the same charset as the remote machine before
	starting <command>ssh</command> or <command>rlogin</command> fixes that
	problem.</para>

	</sect2>

	<sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title>

	<para>
	You can set the above internationalization variables not only when
	starting the first Cygwin process, but also in your Cygwin shell on the
	fly, even switch to yet another character set, and yet another. In bash
	for instance:</para>

	<screen>
	<prompt>bash$</prompt> export LC_CTYPE="nl_BE.UTF-8"
	</screen>

	<para>However, here's a problem. At the start of the first Cygwin process
	in a session, the Windows environment is converted from UTF-16 to UTF-8.
	The environment is another of the system objects stored in UTF-16 in
	Windows.</para>

	<para>As long as the environment only contains ASCII characters, this is
	no problem at all. But if it contains native characters, and you're planning
	to use, say, GBK, the environment will result in invalid characters in
	the GBK charset. This would be especially a problem in variables like
	<envar>PATH</envar>. To circumvent the worst problems, Cygwin converts
	the <envar>PATH</envar> environment variable to the charset set in the
	environment, if it's different from the UTF-8 charset.</para>

	<note><para>Per POSIX, the name of an environment variable should only
	consist of valid ASCII characters, and only of uppercase letters, digits, and
	the underscore for maximum portability.</para></note>

	<para>Very old symbolic links may pose a problem when switching charsets on
	the fly. A symbolic link contains the filename of the target file the
	symlink points to. When a symlink had been created with versions of Cygwin
	prior to Cygwin 1.7, the current ANSI or OEM character set had been used to
	store the target filename, dependent on the old <envar>CYGWIN</envar>
	environment variable setting <envar>codepage</envar> (see <xref
	linkend="cygwinenv-removed-options"></xref>. If the target filename
	contains non-ASCII characters and you use another character set than
	your default ANSI/OEM charset, the target filename of the symlink is now
	potentially an invalid character sequence in the new character set.
	This behaviour is not different from the behaviour in other Operating
	Systems. Recreate the symlink if that happens to you.</para>

	</sect2>

	<sect2 id="setup-locale-charsetlist"><title>List of supported character sets</title>

	<para>Last but not least, here's the list of currently supported character
	sets. The left-hand expression is the name of the charset, as you would use
	it in the internationalization environment variables as outlined above.
	Note that charset specifiers are case-insensitive. <literal>EUCJP</literal>
	is equivalent to <literal>eucJP</literal> or <literal>eUcJp</literal>.
	Writing the charset in the exact case as given in the list below is a
	good convention, though.
	</para>

	<para>The right-hand side is the number of the equivalent Windows
	codepage as well as the Windows name of the codepage. They are only
	noted here for reference. Don't try to use the bare codepage number or
	the Windows name of the codepage as charset in locale specifiers, unless
	they happen to be identical with the left-hand side. Especially in case
	of the "CPxxx" style charsets, always use them with the trailing "CP".</para>

	<para>This works:</para>

	<screen>
	set LC_ALL=en_US.CP437
	</screen>

	<para>This does <emphasis>not</emphasis> work:</para>

	<screen>
	set LC_ALL=en_US.437
	</screen>

	<para>You can find a full list of Windows codepages on the Microsoft MSDN page
	<ulink url="http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx">Code Page Identifiers</ulink>.</para>

	<screen>
	Charset Codepage
	------------------- -------------------------------------------
	ASCII 20127 (US_ASCII)

	CP437 437 (OEM United States)
	CP720 720 (DOS Arabic)
	CP737 737 (OEM Greek)
	CP775 775 (OEM Baltic)
	CP850 850 (OEM Latin 1, Western European)
	CP852 852 (OEM Latin 2, Central European)
	CP855 855 (OEM Cyrillic)
	CP857 857 (OEM Turkish)
	CP858 858 (OEM Latin 1 + Euro Symbol)
	CP862 862 (OEM Hebrew)
	CP866 866 (OEM Russian)
	CP874 874 (ANSI/OEM Thai)
	CP932 932 (Shift_JIS, not exactly identical to SJIS)
	CP1125 1125 (OEM Ukraine)
	CP1250 1250 (ANSI Central European)
	CP1251 1251 (ANSI Cyrillic)
	CP1252 1252 (ANSI Latin 1, Western European)
	CP1253 1253 (ANSI Greek)
	CP1254 1254 (ANSI Turkish)
	CP1255 1255 (ANSI Hebrew)
	CP1256 1256 (ANSI Arabic)
	CP1257 1257 (ANSI Baltic)
	CP1258 1258 (ANSI/OEM Vietnamese)

	ISO-8859-1 28591 (ISO-8859-1)
	ISO-8859-2 28592 (ISO-8859-2)
	ISO-8859-3 28593 (ISO-8859-3)
	ISO-8859-4 28594 (ISO-8859-4)
	ISO-8859-5 28595 (ISO-8859-5)
	ISO-8859-6 28596 (ISO-8859-6)
	ISO-8859-7 28597 (ISO-8859-7)
	ISO-8859-8 28598 (ISO-8859-8)
	ISO-8859-9 28599 (ISO-8859-9)
	ISO-8859-10 - (not available)
	ISO-8859-11 - (not available)
	ISO-8859-13 28603 (ISO-8859-13)
	ISO-8859-14 - (not available)
	ISO-8859-15 28605 (ISO-8859-15)
	ISO-8859-16 - (not available)

	Big5 950 (ANSI/OEM Traditional Chinese)
	EUCCN or euc-CN 936 (ANSI/OEM Simplified Chinese)
	EUCJP or euc-JP 20932 (EUC Japanese)
	EUCKR or euc-KR 949 (EUC Korean)
	GB2312 936 (ANSI/OEM Simplified Chinese)
	GBK 936 (ANSI/OEM Simplified Chinese)
	GEORGIAN-PS - (not available)
	KOI8-R 20866 (KOI8-R Russian Cyrillic)
	KOI8-U 21866 (KOI8-U Ukrainian Cyrillic)
	PT154 - (not available)
	SJIS - (not available, almost, but not exactly CP932)
	TIS620 or TIS-620 874 (ANSI/OEM Thai)

	UTF-8 or utf8 65001 (UTF-8)
	</screen>

	</sect2>

	</sect1>