| <html> |
| |
| <head> |
| <meta http-equiv="Content-Language" content="en-us"> |
| <meta name="GENERATOR" content="Microsoft FrontPage 5.0"> |
| <meta name="ProgId" content="FrontPage.Editor.Document"> |
| <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> |
| <title>1.34 (Internationalization) Changes</title> |
| </head> |
| |
| <body bgcolor="#FFFFFF"> |
| |
| <h1>1.34 (Internationalization) Changes</h1> |
| <h2>Introduction</h2> |
| <p>This release is a major upgrade for the Filesystem Library, in preparation |
| for submission to the C++ Standards Committee. Features of this release |
| include:</p> |
| <ul> |
| <li><a href="#Internationalization">Internationalization</a>, provided by |
| class templates <i>basic_path</i>, <i>basic_filesystem_error</i>, <i> |
| basic_directory_iterator</i>, and <i>basic_directory_entry</i>.<br> |
| </li> |
| <li><a href="#Simplification">Simplification</a> of the path interface, |
| including elimination of distinction between native and generic formats, |
| and separation of name checking functionality from general path functionality. |
| Also simplification of <i>basic_filesystem_error</i>.<br> |
| </li> |
| <li><a href="#Rationalization">Rationalization</a> of predicate function |
| design, including the addition of several new functions.<br> |
| </li> |
| <li>Clearer specification by reference to [<a href="design.htm#POSIX-01">POSIX-01</a>], |
| the ISO/IEEE Single Unix Standard, with provisions for Windows and other |
| operating systems.<br> |
| </li> |
| <li><a href="#Preservation">Preservation</a> of existing user code whenever |
| possible.<br> |
| </li> |
| <li><a href="#More_efficient">More efficient operations</a> when iterating over directories.<br> |
| </li> |
| <li>A <a href="reference.html#recursive_directory_iterator">recursive |
| directory iterator</a> is now provided. </li> |
| </ul> |
| <p><a href="#Rationale">Rationale</a> for some of the changes is also provided.</p> |
| <h2><a name="Internationalization">Internationalization</a></h2> |
| <p>Cass templates <i>basic_path</i>, <i>basic_filesystem_error</i>, and <i> |
| basic_directory_iterator</i> provide the basic mechanisms for |
| internationalization, in ways very similar to the C++ Standard Library's <i> |
| basic_string</i> and similar class templates. The following typedefs are also |
| provided:</p> |
| <blockquote> |
| <pre>typedef basic_path<std::string, ...> path; |
| typedef basic_path<std::wstring, ...> wpath; |
| |
| typedef basic_filesystem_error<path> filesystem_error; |
| typedef basic_filesystem_error<wpath> wfilesystem_error; |
| |
| typedef basic_directory_iterator<path> directory_iterator; |
| typedef basic_directory_iterator<wpath> wdirectory_iterator;</pre> |
| </blockquote> |
| <p>The string type used by Boost.Filesystem <i>basic_path</i> (std::string, |
| std::wstring, or whatever) is called the <i>internal</i> string type. The string |
| type used by the operating system for paths (often char*, sometimes wchar_t*) is |
| called the <i>external</i> string type. Conversion between internal and external |
| types is performed by path traits classes. The specific conversions for <i>path</i> |
| and <i>wpath</i> is implementation defined, with normative encouragement to use |
| the operating system's preferred file system encoding. For many modern POSIX-based |
| file systems the <i>wpath</i> external encoding is <a href="design.htm#Kuhn"> |
| UTF-8</a>, while for modern Windows file systems such as NTFS it is |
| <a href="http://en.wikipedia.org/wiki/UTF-16">UTF-16</a>.</p> |
| <p>The <a href="reference.html#Operations-functions">operational functions</a> in |
| <a href="../../../../boost/filesystem/operations.hpp">operations.hpp</a> are provided with overloads for |
| <i>path</i>, <i>wpath</i>, and user-defined <i>basic_path</i>'s. A |
| <a href="reference.html#Requirements-on-implementations">"do-the-right-thing" rule</a> |
| applies to implementations, ensuring that the correct overload will be chosen.</p> |
| <h2><a name="Simplification">Simplification</a> of path interface</h2> |
| <p>Prior versions of the library required users of class <i>path</i> to identify |
| the format (native or generic) and name error-checking policy, either via a |
| second constructor argument or via a default mechanism. That approach caused |
| complaints, particularly from users not needing the name checking features. The |
| interface has now been simplified:</p> |
| <ul> |
| <li>The distinction between native and generic formats has been eliminated. |
| See <a href="#distinction">rationale</a>. Two argument forms of path |
| constructors are now deprecated, with the second argument having no effect. |
| These constructors are only provided to ease the transition of existing code.<br> |
| </li> |
| <li>Path name checking functionality has been moved out of class path and into |
| separate free-functions. This still provides name checking for those who need |
| it, but with much less impact on those who don't need it.</li> |
| </ul> |
| <p>Additionally, |
| <a href="reference.html#Class-template-basic_filesystem_error">basic_filesystem_error</a> has been put |
| on a diet and generally simplified.</p> |
| <p>Error codes have been moved to a separate library, |
| <a href="../../../system/doc/index.html">Boost.System</a>.</p> |
| <p><code>"//:"</code> has been introduced as a path escape prefix to identify |
| native paths. Rationale: simplifies basic_path constructor interfaces, easier |
| use for platforms needing explicit native format identification.</p> |
| <h2><a name="Rationalization">Rationalization</a> of predicate functions</h2> |
| <p>In discussions and bug reports on the Boost developers mailing list, it |
| became obvious that Boost.Filesystem's exists(), symbolic_link_exists(), and |
| is_directory() predicate functions were poorly specified. There were suggestions |
| to add an is_accessible() function, but Peter Dimov argued that this amounted to |
| papering over the lack of a clear specification and would likely lead to future |
| problems.</p> |
| <p>Peter suggested that an interesting way to analyze the problem was to ask |
| what the expectations were for true and false values of the various predicates. |
| See the <a href="#table">table</a> below.</p> |
| <h3>status()</h3> |
| <p>As part of the predicate discussions, particularly with Rob Stewart, it |
| became obvious that sometimes applications need access to raw status information |
| without any possibility of an exception being thrown. The |
| <a href="reference.html#Status-functions">status()</a> function was added to meet this |
| need. It also proved clearer to specify the semantics of predicate functions in |
| terms of status().</p> |
| <h3><a name="is_file">is_file</a>()</h3> |
| <p>About the same time, Jeff Garland suggested that an |
| <a href="reference.html#Predicate-functions">is_file()</a> predicate would |
| compliment <a href="reference.html#Predicate-functions">is_directory()</a>. In working on the analysis below, it became obvious |
| that the expectations for is_file() were different from the expectations for !is_directory(), |
| so is_file() was added. </p> |
| <h3><a name="is_other">is_other</a>()</h3> |
| <p>On some operating systems, it is possible to have a directory entry which is |
| not for either a directory or a file. The |
| <a href="reference.html#Predicate-functions">is_other()</a> |
| function identifies such cases.</p> |
| <h3>Should predicates throw on errors?</h3> |
| <p>Some conditions reported by operating systems as errors (see |
| <a href="#Footnote">footnote</a>) clearly simply indicate that the predicate is |
| false, rather than indicating serious failure. But other errors represent |
| serious hardware or network problems, or permissions problems.</p> |
| <p>Some people, particularly Rob Stewart, argue that in a function like |
| <a href="reference.html#Predicate-functions">is_directory()</a>, any error should simply cause the function to return false. If |
| there is actually an underlying problem, it will be detected it due course when |
| a directory_iterator or fstream operation is attempted.</p> |
| <p>That view is was rejected because of the following considerations:</p> |
| <ul> |
| <li>As a general principle, the earlier errors can be reported, the better. |
| The rationale being that it is often much cheaper to fix errors sooner rather |
| than later. I've also had a lot of negative experiences where failure to |
| detect errors early caused a lot of pain and unhappy customers. Some of these |
| were directly caused by ignoring error returns from file system operations.<br> |
| </li> |
| <li>Analysis of existing programs indicated that as much as 30% of the use of |
| a predicate was not followed by directory_iterator or fstream operations on |
| the path in question. Instead, the applications performed reporting or |
| fall-back operations that would not fail, and thus were either misleading or |
| completely wrong if the <i>false</i> return value was in fact caused by |
| hardware or network failure, or permissions problems.</li> |
| </ul> |
| <p>However, the discussion did identify that there are valid cases where |
| non-throwing behavior is a requirement, and a programmer may prefer to deal with |
| file or directory attributes and errors at a very low, bit-mask, level. Function <a href="#status">status()</a> |
| was proposed to meet those needs.</p> |
| <h3><a name="Expectations">Expectations</a> <a name="table">table</a></h3> |
| <p>In the table below, <i>p</i> is a non-empty path.</p> |
| <p>Unless otherwise specified, all functions throw on hardware or general |
| failure errors, permission or access errors, symbolic link loop errors, and |
| invalid path errors. If an O/S fails to distinguish between error types, |
| predicate operations return false on such ambiguous errors.</p> |
| <p><i><b>Expectations</b></i> identify operations that are expected to succeed |
| or fail, assuming no hardware, permission, or access right errors, and no race |
| conditions.</p> |
| <table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%"> |
| <tr> |
| <td width="22%" align="center"><b><i>Expression</i></b></td> |
| <td width="48%" align="center"><b><i>Expectations</i></b></td> |
| <td width="108%" align="center"><b><i>Semantics</i></b></td> |
| </tr> |
| <tr> |
| <td width="22%">is_directory(p)</td> |
| <td width="48%">Returns true if p is found and is a directory, else false.<br> |
| If true, then directory_iterator(p) would succeed.<br> |
| If false, then directory_iterator(p) would fail.</td> |
| <td width="108%">Throws: if <a href="#status">status()</a> & error_flag<br> |
| Returns: status() & directory_flag</td> |
| </tr> |
| <tr> |
| <td width="22%">is_file(p)</td> |
| <td width="48%">Returns true if p is found and is not a directory, else |
| false.<br> |
| If true, then ifstream(p) would succeed.<br> |
| False, however, does not imply ifstream(p) would fail (because some |
| operating systems allow directories to be opened as files, but stat() does |
| set the "regular file" flag.)</td> |
| <td width="108%">Throws: if status() & error_flag<br> |
| Returns: status() & file_flag</td> |
| </tr> |
| <tr> |
| <td width="22%">exists(p) </td> |
| <td width="48%">Returns is_directory(p) || is_file(p) || is_other(p)</td> |
| <td width="108%">Throws: if status() & error_flag<br> |
| Returns: status() & (directory_flag|file_flag|other_flag)</td> |
| </tr> |
| <tr> |
| <td width="22%">is_symlink(p)</td> |
| <td width="48%">Returns true if p is found by shallow (non-transitive) |
| search, and is a symbolic link, else false.<br> |
| If true, and p points to q, then for any filesystem function f except those |
| specified as working shallowly on symlinks themselves, f(p) calls f(q), and |
| returns any value returned by f(q).</td> |
| <td width="108%">Throws: if <a href="#status">symlink_status</a>() & |
| error_flag<br> |
| Returns: symlink_status() & symlink_flag</td> |
| </tr> |
| <tr> |
| <td width="22%">!exists(p) && ((p.has_branch_path() && exists( p.branch_path()) |
| || (!p.has_branch_path() && !p.has_root_path()))<br> |
| <i>In other words, if the path does not exist, and (the branch does exist, |
| or (there is no branch and no root)).</i></td> |
| <td width="48%">If true, create_directory(p) would succeed.<br> |
| If true, ofstream(p) would succeed.<br> |
| </td> |
| <td width="108%"> </td> |
| </tr> |
| <tr> |
| <td width="22%">directory_iterator it(p)</td> |
| <td width="48%">If it != directory_iterator(), assert(exists(*it)||is_symlink(*it)). |
| Note: exists(*it) may throw, and likewise status(*it) may return error_flag |
| - there is no guarantee of accessibility.</td> |
| <td width="108%"> </td> |
| </tr> |
| </table> |
| <h3><a name="Conclusion">Conclusion</a></h3> |
| <p>Predicate operations is_directory(), is_file(), is_symlink(), and exists() |
| with the indicated semantics form a self-consistent set that meets expectations.</p> |
| <h2><a name="Preservation">Preservation</a> of existing user code</h2> |
| <p>Although the change to a template based approach required a complete overhaul |
| of the implementation code, the interface as used by existing applications is mostly unchanged. |
| Conversion problems which would |
| otherwise affect user code have been reduced by providing deprecated |
| functions to ease transition. The deprecated functions are:</p> |
| <blockquote> |
| <pre>// class basic_path - 2nd constructor argument ignored: |
| basic_path( const string_type & str, name_check ); |
| basic_path( const typename string_type::value_type * s, name_check ); |
| |
| // class basic_path - old names provided for renamed functions: |
| string_type native_file_string() const; |
| string_type native_directory_string() const; |
| |
| // class basic_path - now defined such that these no longer have any real effect: |
| static bool default_name_check_writable() { return false; } |
| static void default_name_check( name_check ) {} |
| static name_check default_name_check() { return 0; } |
| |
| // non-deducible operations functions assume class path |
| inline path current_path() |
| inline const path & initial_path() |
| |
| // the new basic_directory_entry provides leaf() |
| // to cover the common existing use case itr->leaf() |
| typename Path::string_type leaf() const;</pre> |
| </blockquote> |
| <p>If you do not want the deprecated functions to be included, define the macro BOOST_FILESYSTEM_NO_DEPRECATED.</p> |
| <p>The greatest impact on existing code is the change of directory iterator |
| value type from <code>path</code> to <code>directory_entry</code>. To ease the |
| most common directory iterator use case, <code>basic_directory_entry</code> |
| provides an automatic conversion to <code>basic_path</code>, and this also |
| serves to prevent breakage of a lot of existing code. See the |
| <a href="#More_efficient">next section</a> for discussion of rationale.</p> |
| <blockquote> |
| <pre>// the new basic_directory_entry provides: |
| operator const path_type &() const;</pre> |
| </blockquote> |
| <h2><a name="More_efficient">More efficient</a> operations when iterating over |
| directories</h2> |
| <p>Several common real-world operating systems (BSD derivatives, Linux, Windows) |
| provide status information during directory iteration. Caching of this status |
| information results in three to six times faster operation for typical predicate |
| operations. (For a directory containing 15,047 files, iteration in 1 second vs 6 |
| seconds on a freshly booted system, and 0.3 seconds vs 0.9 seconds after prior use of |
| the directory.</p> |
| <p>The efficiency gains from caching such status information were considered too |
| significant to ignore. Because the possibility of race-conditions differs |
| depending on whether the cached information is used or an actual system call is |
| performed, it was considered necessary to provide explicit functions utilizing |
| the cached information, rather than implicitly using the cache behind the |
| scenes.</p> |
| <p>Three options were explored for exposing the cached status information, with |
| full implementations of each. After initial implementation of option 1 exposed |
| the problems noted below, option 2 was tested as a possible engineering |
| tradeoff. Option 3 |
| was finally chosen as the cleanest design.</p> |
| <table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%"> |
| <tr> |
| <td width="8%" align="center"><b><i>Option</i></b></td> |
| <td width="25%" align="center"><i><b>How cache accessed</b></i></td> |
| <td width="94%" align="center"><i><b>Pros and Cons</b></i></td> |
| </tr> |
| <tr> |
| <td width="8%" valign="top" align="center"><i><b>1</b></i></td> |
| <td width="25%" valign="top">Predicate function overloads<br> |
| (basic_directory_iterator value_type is path)</td> |
| <td width="94%"> |
| <ul> |
| <li>Very Questionable design (friendship abuse, overload abuse, etc)</li> |
| <li>User cannot reuse cache</li> |
| <li>Readability problem; easy to miss difference between f(*it) and f(it)</li> |
| <li>Write-ability problem (error prone?)</li> |
| <li>Most common iterator use is brief: *it</li> |
| <li>Preserves existing code</li> |
| </ul> |
| </td> |
| </tr> |
| <tr> |
| <td width="8%" valign="top" align="center"><b><i>2</i></b></td> |
| <td width="25%" valign="top">Predicate member functions of basic_directory_<span style="background-color: #FFFF00">iterator</span><br> |
| (basic_directory_iterator value_type is path)</td> |
| <td width="94%"> |
| <ul> |
| <li>Somewhat cleaner design (although added iterator functions is unusual)</li> |
| <li>User cannot reuse cache</li> |
| <li>Readability and write-ability is OK: f(*it) and it.f() sufficiently |
| different</li> |
| <li>Most common iterator use is brief: *it</li> |
| <li>Preserves existing code</li> |
| </ul> |
| </td> |
| </tr> |
| <tr> |
| <td width="8%" valign="top" align="center"><b><i>3</i></b></td> |
| <td width="25%" valign="top">Predicate member functions of basic_directory_<span style="background-color: #FFFF00">entry</span><br> |
| (basic_directory_iterator value_type is basic_directory_entry)<br> |
| </td> |
| <td width="94%"> |
| <ul> |
| <li>Cleanest design.</li> |
| <li>User can reuse cache.</li> |
| <li>Readability and write-ability is OK: f(*it) and it->f() sufficiently |
| different.</li> |
| <li>Most common iterator use is longer: it->path(), but by providing |
| "operator const basic_path &" it is still possible to write a bare *it.</li> |
| <li>Breaks some existing code. The "operator const basic_path &" |
| conversion eliminates breakage of the most common use case, while |
| providing a (deprecated) leaf() prevents breakage of the second most |
| common use case.</li> |
| </ul> |
| </td> |
| </tr> |
| </table> |
| <h2><a name="Rationale">Rationale</a></h2> |
| <h3>Elimination of the native versus generic <a name="distinction">distinction</a></h3> |
| <p>Elimination of user confusion and general design simplification was the |
| original motivation for elimination of the distinction between native and |
| generic paths.</p> |
| <p>During design work, a further technical argument was discovered. Consider the |
| path <code>"c:foo/bar"</code>. On many POSIX systems, <code>"c:foo"</code> is a |
| valid directory name, so we have a two element path and there is no issue of |
| native versus generic format. On Windows system, however, <code>"c:"</code> is a |
| drive specification, so we have a three element path. All calls to the operating |
| system will result in <code>"c:"</code> being considered a drive specification; |
| there is no way that fact-of-life can be changed by claiming the format is |
| generic. The native versus generic distinction is thus useless and misleading |
| for POSIX, Windows, and probably most other operating systems.</p> |
| <p>If paths for a particular operating system did require a distinction be made, |
| it could be done by requiring that native paths be prefixed with some unique |
| implementation-defined identification. For example, <code>"native-path:"</code>. |
| This would only be required for operating systems where (1) the distinction |
| mattered, and (2) there was no lexical way to distinguish the two forms. For |
| example, a native operating system that used the same syntax as the Filesystem |
| Library's generic POSIX-like format, but processed the elements right-to-left |
| instead of left-to-right.</p> |
| <h3>Preservation of <a name="existing-code">existing code</a></h3> |
| <p>Allowing existing user code to continue to work with the updated version of |
| the library has obvious benefits in terms of preserving the effort users have |
| applied to both learning the library and writing code which uses the library.</p> |
| <p>There is an additional motivation; other than the name checking portion of |
| class path, the existing interface has proven to be useful and robust, so |
| there is no reason to fiddle with it.</p> |
| <h3><a name="Single_path_design">Single path design</a></h3> |
| <p>During preliminary internationalization discussion on the Boost developer's |
| list, a design was considered for a single path class which could hold either |
| narrow or wide character based paths. That design was rejected because:</p> |
| <ul> |
| <li>The design was, for many applications, an over-generalization with runtime |
| memory and speed costs which would have to be paid for even when not needed.<br> |
| </li> |
| <li>There was concern that the design would be confusing to users, given that |
| the standard library already uses single-value-type strings, rather than |
| strings which morph value types as needed.<br> |
| </li> |
| <li>There were technical issues with conversions when a narrow path was |
| appended to a wide path, and visa versa. The concern was that double |
| conversions could cause incorrect results, that conversions best left to the |
| operating system would be performed, and that the technical complexity was too |
| great in relation to perceived benefits. User-defined types would only make |
| the problem worse.<br> |
| </li> |
| </ul> |
| <h3>No versions of <a href="reference.html#Status-functions">status()</a> which throw exceptions on |
| errors</h3> |
| <p>The rationale for not including versions of status() |
| which throw exceptions on errors is that (1) the primary purpose of this |
| function is to perform queries at a very low-level, where exceptions are usually |
| unwanted, and (2) exceptions on errors are already provided by the predicate |
| functions. There would be little or no efficiency gain from providing a throwing |
| version of status().</p> |
| <h3>Symlink identifying version of <a href="reference.html#Status-functions">status()</a> function</h3> |
| <p>A symlink identifying version of the status() function is distinguished by a |
| second argument. Often separately named functions are more appropriate than |
| overloading when behavior |
| differs, which is the case here, while overloads are more appropriate when |
| behavior is the same but argument types differ (Iain Hanson). Overloading was |
| chosen in this particular case because a subjective judgment that a single |
| function name with an optional "symlink" second argument produced more |
| understandable code. The original implementation of the function used the name "symlink_status", |
| but that just didn't read right in real code.</p> |
| <h3>POSIX wpath_traits defaults to locale(""), but allows imbuing of locale</h3> |
| <p>Vladimir Prus pointed out that for Linux (and presumably other POSIX |
| operating systems) that need to convert wide character paths to narrow |
| characters, the default conversion should not depend on the operating system |
| alone, but on the std::locale("") default. For example, the usual encoding |
| for Russian on Linux (and Russian web sites) is KOI8-R (RFC1489). The ability to safely specify a different locale |
| is also provided, to meet unforeseen needs.</p> |
| <hr> |
| <p>Revised |
| <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->18 March, 2008<!--webbot bot="Timestamp" endspan i-checksum="29005" --></p> |
| <p>© Copyright Beman Dawes, 2005</p> |
| <p>Distributed under the Boost Software License, Version 1.0. |
| (See accompanying file <a href="../../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or |
| copy at <a href="http://www.boost.org/LICENSE_1_0.txt">www.boost.org/LICENSE_1_0.txt</a>)</p> |
| |
| </body> |
| |
| </html> |