blob: 1305493d8978375d2e0e4ab34e436f0e1a642945 [file] [log] [blame]
<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>1.34 (Internationalization) Changes</title>
</head>
<body bgcolor="#FFFFFF">
<h1>1.34 (Internationalization) Changes</h1>
<h2>Introduction</h2>
<p>This release is a major upgrade for the Filesystem Library, in preparation
for submission to the C++ Standards Committee. Features of this release
include:</p>
<ul>
<li><a href="#Internationalization">Internationalization</a>, provided by
class templates <i>basic_path</i>, <i>basic_filesystem_error</i>, <i>
basic_directory_iterator</i>, and <i>basic_directory_entry</i>.<br>
&nbsp;</li>
<li><a href="#Simplification">Simplification</a> of the path interface,
including elimination of distinction between native and generic formats,
and separation of name checking functionality from general path functionality.
Also simplification of <i>basic_filesystem_error</i>.<br>
&nbsp;</li>
<li><a href="#Rationalization">Rationalization</a> of predicate function
design, including the addition of several new functions.<br>
&nbsp;</li>
<li>Clearer specification by reference to [<a href="design.htm#POSIX-01">POSIX-01</a>],
the ISO/IEEE Single Unix Standard, with provisions for Windows and other
operating systems.<br>
&nbsp;</li>
<li><a href="#Preservation">Preservation</a> of existing user code whenever
possible.<br>
&nbsp;</li>
<li><a href="#More_efficient">More efficient operations</a> when iterating over directories.<br>
&nbsp;</li>
<li>A <a href="reference.html#recursive_directory_iterator">recursive
directory iterator</a> is now provided. </li>
</ul>
<p><a href="#Rationale">Rationale</a> for some of the changes is also provided.</p>
<h2><a name="Internationalization">Internationalization</a></h2>
<p>Cass templates <i>basic_path</i>, <i>basic_filesystem_error</i>, and <i>
basic_directory_iterator</i> provide the basic mechanisms for
internationalization, in ways very similar to the C++ Standard Library's <i>
basic_string</i> and similar class templates. The following typedefs are also
provided:</p>
<blockquote>
<pre>typedef basic_path&lt;std::string, ...&gt; path;
typedef basic_path&lt;std::wstring, ...&gt; wpath;
typedef basic_filesystem_error&lt;path&gt; filesystem_error;
typedef basic_filesystem_error&lt;wpath&gt; wfilesystem_error;
typedef basic_directory_iterator&lt;path&gt; directory_iterator;
typedef basic_directory_iterator&lt;wpath&gt; wdirectory_iterator;</pre>
</blockquote>
<p>The string type used by Boost.Filesystem <i>basic_path</i> (std::string,
std::wstring, or whatever) is called the <i>internal</i> string type. The string
type used by the operating system for paths (often char*, sometimes wchar_t*) is
called the <i>external</i> string type. Conversion between internal and external
types is performed by path traits classes. The specific conversions for <i>path</i>
and <i>wpath</i> is implementation defined, with normative encouragement to use
the operating system's preferred file system encoding. For many modern POSIX-based
file systems the <i>wpath</i> external encoding is <a href="design.htm#Kuhn">
UTF-8</a>, while for modern Windows file systems such as NTFS it is
<a href="http://en.wikipedia.org/wiki/UTF-16">UTF-16</a>.</p>
<p>The <a href="reference.html#Operations-functions">operational functions</a> in
<a href="../../../../boost/filesystem/operations.hpp">operations.hpp</a> are provided with overloads for
<i>path</i>, <i>wpath</i>, and user-defined <i>basic_path</i>'s. A
<a href="reference.html#Requirements-on-implementations">&quot;do-the-right-thing&quot; rule</a>
applies to implementations, ensuring that the correct overload will be chosen.</p>
<h2><a name="Simplification">Simplification</a> of path interface</h2>
<p>Prior versions of the library required users of class <i>path</i> to identify
the format (native or generic) and name error-checking policy, either via a
second constructor argument or via a default mechanism. That approach caused
complaints, particularly from users not needing the name checking features. The
interface has now been simplified:</p>
<ul>
<li>The distinction between native and generic formats has been eliminated.
See <a href="#distinction">rationale</a>. Two argument forms of path
constructors are now deprecated, with the second argument having no effect.
These constructors are only provided to ease the transition of existing code.<br>
&nbsp;</li>
<li>Path name checking functionality has been moved out of class path and into
separate free-functions. This still provides name checking for those who need
it, but with much less impact on those who don't need it.</li>
</ul>
<p>Additionally,
<a href="reference.html#Class-template-basic_filesystem_error">basic_filesystem_error</a> has been put
on a diet and generally simplified.</p>
<p>Error codes have been moved to a separate library,
<a href="../../../system/doc/index.html">Boost.System</a>.</p>
<p><code>&quot;//:&quot;</code> has been introduced as a path escape prefix to identify
native paths. Rationale: simplifies basic_path constructor interfaces, easier
use for platforms needing explicit native format identification.</p>
<h2><a name="Rationalization">Rationalization</a> of predicate functions</h2>
<p>In discussions and bug reports on the Boost developers mailing list, it
became obvious that Boost.Filesystem's exists(), symbolic_link_exists(), and
is_directory() predicate functions were poorly specified. There were suggestions
to add an is_accessible() function, but Peter Dimov argued that this amounted to
papering over the lack of a clear specification and would likely lead to future
problems.</p>
<p>Peter suggested that an interesting way to analyze the problem was to ask
what the expectations were for true and false values of the various predicates.
See the <a href="#table">table</a> below.</p>
<h3>status()</h3>
<p>As part of the predicate discussions, particularly with Rob Stewart, it
became obvious that sometimes applications need access to raw status information
without any possibility of an exception being thrown. The
<a href="reference.html#Status-functions">status()</a> function was added to meet this
need. It also proved clearer to specify the semantics of predicate functions in
terms of status().</p>
<h3><a name="is_file">is_file</a>()</h3>
<p>About the same time, Jeff Garland suggested that an
<a href="reference.html#Predicate-functions">is_file()</a> predicate would
compliment <a href="reference.html#Predicate-functions">is_directory()</a>. In working on the analysis below, it became obvious
that the expectations for is_file() were different from the expectations for !is_directory(),
so is_file() was added. </p>
<h3><a name="is_other">is_other</a>()</h3>
<p>On some operating systems, it is possible to have a directory entry which is
not for either a directory or a file. The
<a href="reference.html#Predicate-functions">is_other()</a>
function identifies such cases.</p>
<h3>Should predicates throw on errors?</h3>
<p>Some conditions reported by operating systems as errors (see
<a href="#Footnote">footnote</a>) clearly simply indicate that the predicate is
false, rather than indicating serious failure. But other errors represent
serious hardware or network problems, or permissions problems.</p>
<p>Some people, particularly Rob Stewart, argue that in a function like
<a href="reference.html#Predicate-functions">is_directory()</a>, any error should simply cause the function to return false. If
there is actually an underlying problem, it will be detected it due course when
a directory_iterator or fstream operation is attempted.</p>
<p>That view is was rejected because of the following considerations:</p>
<ul>
<li>As a general principle, the earlier errors can be reported, the better.
The rationale being that it is often much cheaper to fix errors sooner rather
than later. I've also had a lot of negative experiences where failure to
detect errors early caused a lot of pain and unhappy customers. Some of these
were directly caused by ignoring error returns from file system operations.<br>
&nbsp;</li>
<li>Analysis of existing programs indicated that as much as 30% of the use of
a predicate was not followed by directory_iterator or fstream operations on
the path in question. Instead, the applications performed reporting or
fall-back operations that would not fail, and thus were either misleading or
completely wrong if the <i>false</i> return value was in fact caused by
hardware or network failure, or permissions problems.</li>
</ul>
<p>However, the discussion did identify that there are valid cases where
non-throwing behavior is a requirement, and a programmer may prefer to deal with
file or directory attributes and errors at a very low, bit-mask, level. Function <a href="#status">status()</a>
was proposed to meet those needs.</p>
<h3><a name="Expectations">Expectations</a> <a name="table">table</a></h3>
<p>In the table below, <i>p</i> is a non-empty path.</p>
<p>Unless otherwise specified, all functions throw on hardware or general
failure errors, permission or access errors, symbolic link loop errors, and
invalid path errors. If an O/S fails to distinguish between error types,
predicate operations return false on such ambiguous errors.</p>
<p><i><b>Expectations</b></i> identify operations that are expected to succeed
or fail, assuming no hardware, permission, or access right errors, and no race
conditions.</p>
<table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">
<tr>
<td width="22%" align="center"><b><i>Expression</i></b></td>
<td width="48%" align="center"><b><i>Expectations</i></b></td>
<td width="108%" align="center"><b><i>Semantics</i></b></td>
</tr>
<tr>
<td width="22%">is_directory(p)</td>
<td width="48%">Returns true if p is found and is a directory, else false.<br>
If true, then directory_iterator(p) would succeed.<br>
If false, then directory_iterator(p) would fail.</td>
<td width="108%">Throws: if <a href="#status">status()</a> &amp; error_flag<br>
Returns: status() &amp; directory_flag</td>
</tr>
<tr>
<td width="22%">is_file(p)</td>
<td width="48%">Returns true if p is found and is not a directory, else
false.<br>
If true, then ifstream(p) would succeed.<br>
False, however, does not imply ifstream(p) would fail (because some
operating systems allow directories to be opened as files, but stat() does
set the &quot;regular file&quot; flag.)</td>
<td width="108%">Throws: if status() &amp; error_flag<br>
Returns: status() &amp; file_flag</td>
</tr>
<tr>
<td width="22%">exists(p) </td>
<td width="48%">Returns is_directory(p) || is_file(p) || is_other(p)</td>
<td width="108%">Throws: if status() &amp; error_flag<br>
Returns: status() &amp;&nbsp;&nbsp; (directory_flag|file_flag|other_flag)</td>
</tr>
<tr>
<td width="22%">is_symlink(p)</td>
<td width="48%">Returns true if p is found by shallow (non-transitive)
search, and is a symbolic link, else false.<br>
If true, and p points to q, then for any filesystem function f except those
specified as working shallowly on symlinks themselves, f(p) calls f(q), and
returns any value returned by f(q).</td>
<td width="108%">Throws: if <a href="#status">symlink_status</a>() &amp;
error_flag<br>
Returns: symlink_status() &amp; symlink_flag</td>
</tr>
<tr>
<td width="22%">!exists(p) &amp;&amp; ((p.has_branch_path() &amp;&amp; exists( p.branch_path())
|| (!p.has_branch_path() &amp;&amp; !p.has_root_path()))<br>
<i>In other words, if the path does not exist, and (the branch does exist,
or (there is no branch and no root)).</i></td>
<td width="48%">If true, create_directory(p) would succeed.<br>
If true, ofstream(p) would succeed.<br>
&nbsp;</td>
<td width="108%">&nbsp;</td>
</tr>
<tr>
<td width="22%">directory_iterator it(p)</td>
<td width="48%">If it != directory_iterator(), assert(exists(*it)||is_symlink(*it)).
Note: exists(*it) may throw, and likewise status(*it) may return error_flag
- there is no guarantee of accessibility.</td>
<td width="108%">&nbsp;</td>
</tr>
</table>
<h3><a name="Conclusion">Conclusion</a></h3>
<p>Predicate operations is_directory(), is_file(), is_symlink(), and exists()
with the indicated semantics form a self-consistent set that meets expectations.</p>
<h2><a name="Preservation">Preservation</a> of existing user code</h2>
<p>Although the change to a template based approach required a complete overhaul
of the implementation code, the interface as used by existing applications is mostly unchanged.
Conversion problems which would
otherwise affect user code have been reduced by providing deprecated
functions to ease transition. The deprecated functions are:</p>
<blockquote>
<pre>// class basic_path - 2nd constructor argument ignored:
basic_path( const string_type &amp; str, name_check );
basic_path( const typename string_type::value_type * s, name_check );
// class basic_path - old names provided for renamed functions:
string_type native_file_string() const;
string_type native_directory_string() const;
// class basic_path - now defined such that these no longer have any real effect:
static bool default_name_check_writable() { return false; }
static void default_name_check( name_check ) {}
static name_check default_name_check() { return 0; }
// non-deducible operations functions assume class path
inline path current_path()
inline const path &amp; initial_path()
// the new basic_directory_entry provides leaf()
// to cover the common existing use case itr-&gt;leaf()
typename Path::string_type leaf() const;</pre>
</blockquote>
<p>If you do not want the deprecated functions to be included, define the macro BOOST_FILESYSTEM_NO_DEPRECATED.</p>
<p>The greatest impact on existing code is the change of directory iterator
value type from <code>path</code> to <code>directory_entry</code>. To ease the
most common directory iterator use case, <code>basic_directory_entry</code>
provides an automatic conversion to <code>basic_path</code>, and this also
serves to prevent breakage of a lot of existing code. See the
<a href="#More_efficient">next section</a> for discussion of rationale.</p>
<blockquote>
<pre>// the new basic_directory_entry provides:
operator const path_type &amp;() const;</pre>
</blockquote>
<h2><a name="More_efficient">More efficient</a> operations when iterating over
directories</h2>
<p>Several common real-world operating systems (BSD derivatives, Linux, Windows)
provide status information during directory iteration. Caching of this status
information results in three to six times faster operation for typical predicate
operations. (For a directory containing 15,047 files, iteration in 1 second vs 6
seconds on a freshly booted system, and 0.3 seconds vs 0.9 seconds after prior use of
the directory.</p>
<p>The efficiency gains from caching such status information were considered too
significant to ignore. Because the possibility of race-conditions differs
depending on whether the cached information is used or an actual system call is
performed, it was considered necessary to provide explicit functions utilizing
the cached information, rather than implicitly using the cache behind the
scenes.</p>
<p>Three options were explored for exposing the cached status information, with
full implementations of each. After initial implementation of option 1 exposed
the problems noted below, option 2 was tested as a possible engineering
tradeoff. Option 3
was finally chosen as the cleanest design.</p>
<table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">
<tr>
<td width="8%" align="center"><b><i>Option</i></b></td>
<td width="25%" align="center"><i><b>How cache accessed</b></i></td>
<td width="94%" align="center"><i><b>Pros and Cons</b></i></td>
</tr>
<tr>
<td width="8%" valign="top" align="center"><i><b>1</b></i></td>
<td width="25%" valign="top">Predicate function overloads<br>
(basic_directory_iterator value_type is path)</td>
<td width="94%">
<ul>
<li>Very Questionable design (friendship abuse, overload abuse, etc)</li>
<li>User cannot reuse cache</li>
<li>Readability problem; easy to miss difference between f(*it) and f(it)</li>
<li>Write-ability problem (error prone?)</li>
<li>Most common iterator use is brief: *it</li>
<li>Preserves existing code</li>
</ul>
</td>
</tr>
<tr>
<td width="8%" valign="top" align="center"><b><i>2</i></b></td>
<td width="25%" valign="top">Predicate member functions of basic_directory_<span style="background-color: #FFFF00">iterator</span><br>
(basic_directory_iterator value_type is path)</td>
<td width="94%">
<ul>
<li>Somewhat cleaner design (although added iterator functions is unusual)</li>
<li>User cannot reuse cache</li>
<li>Readability and write-ability is OK: f(*it) and it.f() sufficiently
different</li>
<li>Most common iterator use is brief: *it</li>
<li>Preserves existing code</li>
</ul>
</td>
</tr>
<tr>
<td width="8%" valign="top" align="center"><b><i>3</i></b></td>
<td width="25%" valign="top">Predicate member functions of basic_directory_<span style="background-color: #FFFF00">entry</span><br>
(basic_directory_iterator value_type is basic_directory_entry)<br>
&nbsp;</td>
<td width="94%">
<ul>
<li>Cleanest design.</li>
<li>User can reuse cache.</li>
<li>Readability and write-ability is OK: f(*it) and it-&gt;f() sufficiently
different.</li>
<li>Most common iterator use is longer: it-&gt;path(), but by providing
&quot;operator const basic_path &amp;&quot; it is still possible to write a bare *it.</li>
<li>Breaks some existing code. The &quot;operator const basic_path &amp;&quot;
conversion eliminates breakage of the most common use case, while
providing a (deprecated) leaf() prevents breakage of the second most
common use case.</li>
</ul>
</td>
</tr>
</table>
<h2><a name="Rationale">Rationale</a></h2>
<h3>Elimination of the native versus generic <a name="distinction">distinction</a></h3>
<p>Elimination of user confusion and general design simplification was the
original motivation for elimination of the distinction between native and
generic paths.</p>
<p>During design work, a further technical argument was discovered. Consider the
path <code>&quot;c:foo/bar&quot;</code>. On many POSIX systems, <code>&quot;c:foo&quot;</code> is a
valid directory name, so we have a two element path and there is no issue of
native versus generic format. On Windows system, however, <code>&quot;c:&quot;</code> is a
drive specification, so we have a three element path. All calls to the operating
system will result in <code>&quot;c:&quot;</code> being considered a drive specification;
there is no way that fact-of-life can be changed by claiming the format is
generic. The native versus generic distinction is thus useless and misleading
for POSIX, Windows, and probably most other operating systems.</p>
<p>If paths for a particular operating system did require a distinction be made,
it could be done by requiring that native paths be prefixed with some unique
implementation-defined identification. For example, <code>&quot;native-path:&quot;</code>.
This would only be required for operating systems where (1) the distinction
mattered, and (2) there was no lexical way to distinguish the two forms. For
example, a native operating system that used the same syntax as the Filesystem
Library's generic POSIX-like format, but processed the elements right-to-left
instead of left-to-right.</p>
<h3>Preservation of <a name="existing-code">existing code</a></h3>
<p>Allowing existing user code to continue to work with the updated version of
the library has obvious benefits in terms of preserving the effort users have
applied to both learning the library and writing code which uses the library.</p>
<p>There is an additional motivation; other than the name checking portion of
class path,&nbsp; the existing interface has proven to be useful and robust, so
there is no reason to fiddle with it.</p>
<h3><a name="Single_path_design">Single path design</a></h3>
<p>During preliminary internationalization discussion on the Boost developer's
list, a design was considered for a single path class which could hold either
narrow or wide character based paths. That design was rejected because:</p>
<ul>
<li>The design was, for many applications, an over-generalization with runtime
memory and speed costs which would have to be paid for even when not needed.<br>
&nbsp;</li>
<li>There was concern that the design would be confusing to users, given that
the standard library already uses single-value-type strings, rather than
strings which morph value types as needed.<br>
&nbsp;</li>
<li>There were technical issues with conversions when a narrow path was
appended to a wide path, and visa versa. The concern was that double
conversions could cause incorrect results, that conversions best left to the
operating system would be performed, and that the technical complexity was too
great in relation to perceived benefits. User-defined types would only make
the problem worse.<br>
&nbsp;</li>
</ul>
<h3>No versions of <a href="reference.html#Status-functions">status()</a> which throw exceptions on
errors</h3>
<p>The rationale for not including versions of status()
which throw exceptions on errors is that (1) the primary purpose of this
function is to perform queries at a very low-level, where exceptions are usually
unwanted, and (2) exceptions on errors are already provided by the predicate
functions. There would be little or no efficiency gain from providing a throwing
version of status().</p>
<h3>Symlink identifying version of <a href="reference.html#Status-functions">status()</a> function</h3>
<p>A symlink identifying version of the status() function is distinguished by a
second argument. Often separately named functions are more appropriate than
overloading when behavior
differs, which is the case here, while overloads are more appropriate when
behavior is the same but argument types differ (Iain Hanson). Overloading was
chosen in this particular case because a subjective judgment that a single
function name with an optional &quot;symlink&quot; second argument produced more
understandable code. The original implementation of the function used the name &quot;symlink_status&quot;,
but that just didn't read right in real code.</p>
<h3>POSIX wpath_traits defaults to locale(&quot;&quot;), but allows imbuing of locale</h3>
<p>Vladimir Prus pointed out that for Linux (and presumably other POSIX
operating systems) that need to convert wide character paths to narrow
characters, the default conversion should not depend on the operating system
alone, but on the std::locale(&quot;&quot;) default. For example, the usual encoding
for Russian on Linux (and Russian web sites) is KOI8-R (RFC1489). The ability to safely specify a different locale
is also provided, to meet unforeseen needs.</p>
<hr>
<p>Revised
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->18 March, 2008<!--webbot bot="Timestamp" endspan i-checksum="29005" --></p>
<p>© Copyright Beman Dawes, 2005</p>
<p>Distributed under the Boost Software License, Version 1.0.
(See accompanying file <a href="../../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
copy at <a href="http://www.boost.org/LICENSE_1_0.txt">www.boost.org/LICENSE_1_0.txt</a>)</p>
</body>
</html>