| http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html |
| |
| "The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category." |
| |
| ------- |
| |
| http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html |
| |
| From: Markus Kuhn |
| |
| Tom Tromey wrote on 2001-02-05 00:36 UTC: |
| > Kai> IMAO, a *real* filesystem should use some encoding of ISO 10646 - |
| > Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should |
| > Kai> be true for the kernel filename interfaces. |
| > |
| > I like this, but what should I do right now? |
| |
| The POSIX kernel file system interface is engraved into stone and |
| extremely unlikely to change. File names are arbitrary binary strings, |
| with only the '/' and '\0' bytes having any special semantics. You can |
| use arbitrary coded character sets on it as long as they do not |
| introduce '/' and '\0' bytes spuriously. Writers and readers have to |
| somehow agree on what encoding to use and the only really practical way |
| is to use the same encoding on all systems that share files. Eventually, |
| everyone will be using UTF-8 for file names on POSIX systems. Right now, |
| I would recommend users to use only ASCII for filenames, as this is |
| already UTF-8 and therefore simplifies migration. Using the ISO 8859, |
| JIS, etc. filenames should soon be considered deprecated practice. |
| |
| > I work on libgcj, the runtime component of gcj, the Java front end to |
| > GCC. In libgcj of course we use UCS-2 everywhere, since that is what |
| > Java does. Currently, for Unixy systems, we assume that all file |
| > names are UTF-8. |
| |
| The best solution is to assume that the file names are in the |
| locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to |
| convert between Unicode and the locale-dependent multi-byte encoding |
| used in file names and text files if the ISO C 99 symbol |
| __STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On |
| Linux, this has been the case since glibc 2.2. |
| |
| > (Actually, we do something notably worse, which is |
| > assume that file names are Java-style UTF-8, with the weird encoding |
| > for \u0000.) |
| |
| \u0000 = NUL was never a character allowed in filenames under POSIX. |
| Raise an exception if someone tries to use it in a filename. Problem |
| solved. |
| |
| I never understood, why Java found it necessary to introduce two |
| distinct ASCII NUL characters. |
| |
| ------ |
| |
| Interesting idea. Use iconv to create shift-jis or other mbcs test cases. |