- Start Date: 2014-11-12
- RFC PR: rust-lang/rfcs#474
- Rust Issue: rust-lang/rust#20034
摘要
This RFC reforms the design of the std::path module in preparation for API stabilization. The path API must deal with many competing demands, and the current design handles many of them, but suffers from some significant problems given in “Motivation” below. The RFC proposes a redesign modeled loosely on the current API that addresses these problems while maintaining the advantages of the current design.
動機
The design of a path abstraction is surprisingly hard. Paths work radically differently on different platforms, so providing a cross-platform abstraction is challenging. On some platforms, paths are not required to be in Unicode, posing ergonomic and semantic difficulties for a Rust API. These difficulties are compounded if one also tries to provide efficient path manipulation that does not, for example, require extraneous copying. And, of course, the API should be easy and pleasant to use.
The current std::path module makes a strong effort to balance these design constraints, but over time a few key shortcomings have emerged.
Semantic problems
Most importantly, the current std::path module makes some semantic assumptions about paths that have turned out to be incorrect.
Normalization
Paths in std::path are always normalized, meaning that a/../b is treated like b (among other things). Unfortunately, this kind of normalization changes the meaning of paths when symbolic links are present: if a is a symbolic link, then the relative paths a/../b and b may refer to completely different locations. See this issue for more detail.
For this reason, most path libraries do not perform full normalization of paths, though they may normalize paths like a/./b to a/b. Instead, they offer (1) methods to optionally normalize and (2) methods to normalize based on the contents of the underlying file system.
Since our current normalization scheme can silently and incorrectly alter the meaning of paths, it needs to be changed.
Unicode and Windows
In the original std::path design, it was assumed that all paths on Windows were Unicode. However, it turns out that the Windows filesystem APIs actually work with UCS-2, which roughly means that they accept arbitrary sequences of u16 values but interpret them as UTF-16 when it is valid to do so.
The current std::path implementation is built around the assumption that Windows paths can be represented as Rust string slices, and will need to be substantially revised.
Ergonomic problems
Because paths in general are not in Unicode, the std::path module cannot rely on an internal string or string slice representation. That in turn causes trouble for methods like dirname that are intended to extract a subcomponent of a path -- what should it return?
There are basically three possible options, and today’s std::path module chooses all of them:
- Yield a byte sequence:
dirnameyields an&[u8] - Yield a string slice, accounting for potential non-UTF-8 values:
dirname_stryields anOption<&str> - Yield another path:
dir_pathyields aPath
This redundancy is present for most of the decomposition methods. The saving grace is that, in general, path methods consume BytesContainer values, so one can use the &[u8] variant but continue to work with other path methods. But in general &[u8] values are not ergonomic to work with, and the explosion in methods makes the module more (superficially) complex than one might expect.
You might be tempted to provide only the third option, but Path values are owned and mutable, so that would imply cloning on every decomposition operation. For applications like Cargo that work heavily with paths, this would be an unfortunate (and seemingly unnecessary) overhead.
Organizational problems
Finally, the std::path module presents a somewhat complex API organization:
- The
Pathtype is a direct alias of a platform-specific path type. - The
GenericPathtrait provides most of the common API expected on both platforms. - The
GenericPathUnsafetrait provides a few unsafe/unchecked functions for performance reasons. - The
posixandwindowssubmodules provide their ownPathtypes and a handful of platform-specific functionality (in particular,windowsprovides support for working with volumes and “verbatim” paths prefixed with\\?\)
This organization needs to be updated to match current conventions and simplified if possible.
One thing to note: with the current organization, it is possible to work with non-native paths, which can sometimes be useful for interoperation. The new design should retain this functionality.
詳細設計
Note: this design is influenced by the Boost filesystem library and Scheme48 and Racket’s approach to encoding issues on windows.
Overview
The basic design uses DST to follow the same pattern as Vec<T>/[T] and String/str: there is a PathBuf type for owned, mutable paths and an unsized Path type for slices. The various “decomposition” methods for extracting components of a path all return slices, and PathBuf itself derefs to Path.
The result is an API that is both efficient and ergonomic: there is no need to allocate/copy when decomposing a path, but there is also no need to provide multiple variants of methods to extract bytes versus Unicode strings. For example, the Path slice type provides a single method for converting to a str slice (when applicable).
A key aspect of the design is that there is no internal normalization of paths at all. Aside from solving the symbolic link problem, this choice also has useful ramifications for the rest of the API, described below.
The proposed API deals with the other problems mentioned above, and also brings the module in line with current Rust patterns and conventions. These details will be discussed after getting a first look at the core API.
The cross-platform API
The proposed core, cross-platform API provided by the new std::path is as follows:
// A sized, owned type akin to String:
pub struct PathBuf { .. }
// An unsized slice type akin to str:
pub struct Path { .. }
// Some ergonomics and generics, following the pattern in String/str and Vec<T>/[T]
impl Deref<Path> for PathBuf { ... }
impl BorrowFrom<PathBuf> for Path { ... }
// A replacement for BytesContainer; used to cut down on explicit coercions
pub trait AsPath for Sized? {
fn as_path(&self) -> &Path;
}
impl<Sized? P> PathBuf where P: AsPath {
pub fn new<T: IntoString>(path: T) -> PathBuf;
pub fn push(&mut self, path: &P);
pub fn pop(&mut self) -> bool;
pub fn set_file_name(&mut self, file_name: &P);
pub fn set_extension(&mut self, extension: &P);
}
// These will ultimately replace the need for `push_many`
impl<Sized? P> FromIterator<P> for PathBuf where P: AsPath { .. }
impl<Sized? P> Extend<P> for PathBuf where P: AsPath { .. }
impl<Sized? P> Path where P: AsPath {
pub fn new(path: &str) -> &Path;
pub fn as_str(&self) -> Option<&str>
pub fn to_str_lossy(&self) -> Cow<String, str>; // Cow will replace MaybeOwned
pub fn to_owned(&self) -> PathBuf;
// iterate over the components of a path
pub fn iter(&self) -> Iter;
pub fn is_absolute(&self) -> bool;
pub fn is_relative(&self) -> bool;
pub fn is_ancestor_of(&self, other: &P) -> bool;
pub fn path_relative_from(&self, base: &P) -> Option<PathBuf>;
pub fn starts_with(&self, base: &P) -> bool;
pub fn ends_with(&self, child: &P) -> bool;
// The "root" part of the path, if absolute
pub fn root_path(&self) -> Option<&Path>;
// The "non-root" part of the path
pub fn relative_path(&self) -> &Path;
// The "directory" portion of the path
pub fn dir_path(&self) -> &Path;
pub fn file_name(&self) -> Option<&Path>;
pub fn file_stem(&self) -> Option<&Path>;
pub fn extension(&self) -> Option<&Path>;
pub fn join(&self, path: &P) -> PathBuf;
pub fn with_file_name(&self, file_name: &P) -> PathBuf;
pub fn with_extension(&self, extension: &P) -> PathBuf;
}
pub struct Iter<'a> { .. }
impl<'a> Iterator<&'a Path> for Iter<'a> { .. }
pub const SEP: char = ..
pub const ALT_SEPS: &'static [char] = ..
pub fn is_separator(c: char) -> bool { .. }
There is plenty of overlap with today’s API, and the methods being retained here largely have the same semantics.
But there are also a few potentially surprising aspects of this design that merit comment:
-
Why does
PathBuf::newtakeIntoString? It needs an owned buffer internally, and taking a string means that Unicode input is guaranteed, which works on all platforms. (In general, the assumption is that non-Unicode paths are most commonly produced by reading a path from the filesystem, rather than creating now ones. As we’ll see below, there are platform-specific ways to crate non-Unicode paths.) -
Why no
Path::as_bytesmethod? There is no cross-platform way to expose paths directly in terms of byte sequences, because each platform extends beyond Unicode in its own way. In particular, Unix platforms accept arbitrary u8 sequences, while Windows accepts arbitrary u16 sequences (both modulo disallowing interior 0s). The u16 sequences provided by Windows do not have a canonical encoding as bytes; this RFC proposed to use WTF-8 (see below), but does not reveal that choice. -
What about interior nulls? Currently various Rust system APIs will panic when given strings containing interior null values because, while these are valid UTF-8, it is not possible to send them as-is to C APIs that expect null-terminated strings. The API here follows the same approach, panicking if given a path with an interior null.
-
Why do
file_nameandextensionoperations work withPathrather than some other type? In particular, it may seem strange to view an extension as a path. But doing so allows us to not reveal platform differences about the various character sets used in paths. By and large, extensions in practice will be valid Unicode, so the various methods going to and fromstrwill suffice. But as with paths in general, there are platform-specific ways of working with non-Unicode data, explained below. -
Where did
push_manyand friends go? They’re replaced by implementingFromIteratorandExtend, following a similar pattern with theVectype. (Some work will be needed to retain full efficiency when doing so.) -
How does
Path::newwork? The ability to directly get a&Pathfrom an&str(i.e., with no allocation or other work) is a key part of the representation choices, which are described below. -
Where is the
normalizemethod? Since the path type no longer internally normalizes, it may be useful to explicitly request normalization. This can be done by writinglet normalized: PathBuf = p.iter().collect()for a pathp, because the iterator performs some on-the-fly normalization (see below). *NOTE this normalization does not include removing.., for the reasons explained at the beginning of the RFC. -
What does the iterator yield? Unlike today’s
components, theitermethod here will begin withroot_pathif there is one. Thus,a/b/cwill yielda,bandc, while/a/b/cwill yield/,a,bandc.
Important semantic rules
The path API is designed to satisfy several semantic rules described below. Note that == here is lazily normalizing, treating ./b as b and a//b as a/b; see the next section for more details.
Suppose p is some &Path and dot == Path::new("."):
p == p.join(dot)
p == dot.join(p)
p == p.root_path().unwrap_or(dot)
.join(p.relative_path())
p.relative_path() == match p.root_path() {
None => p,
Some(root) => p.path_relative_from(root).unwrap()
}
p == p.dir_path()
.join(p.file_name().unwrap_or(dot))
p == p.iter().collect()
p == match p.file_name() {
None => p,
Some(name) => p.with_file_name(name)
}
p == match p.extension() {
None => p,
Some(ext) => p.with_extension(ext)
}
p == match (p.file_stem(), p.extension()) {
(Some(stem), Some(ext)) => p.with_file_name(name).with_extension(ext),
_ => p
}
Representation choices, Unicode, and normalization
A lot of the design in this RFC depends on a key property: both Unix and Windows paths can be easily represented as a flat byte sequence “compatible” with UTF-8. For Unix platforms, this is trivial: they accept any byte sequence, and will generally interpret the byte sequences as UTF-8 when valid to do so. For Windows, this representation involves a clever hack – proposed formally as WTF-8 – that encodes its native UCS-2 in a generalization of UTF-8. This RFC will not go into the details of that hack; please read Simon’s excellent writeup if you’re interested.
The upshot of all of this is that we can uniformly represent path slices as newtyped byte slices, and any UTF-8 encoded data will “do the right thing” on all platforms.
Furthermore, by not doing any internal, up-front normalization, it’s possible to provide a Path::new that goes from &str to &Path with no intermediate allocation or validation. In the common case that you’re working with Rust strings to construct paths, there is zero overhead. It also means that Path::new(some_str).as_str = Some(some_str).
The main downside of this choice is that some of the path functionality must cope with non-normalized paths. So, for example, the iterator must skip . path components (unless it is the entire path), and similarly for methods like pop. In general, methods that yield new path slices are expected to work as if:
./bis justba//bis justa/b
and comparisons between paths should also behave as if the paths had been normalized in this way.
Organization and platform-specific APIs
Finally, the proposed API is organized as std::path with unix and windows submodules, as today. However, there is no GenericPath or GenericPathUnsafe; instead, the API given above is implemented as a trivial wrapper around path implementations provided by either the unix or the windows submodule (based on #[cfg]). In other words:
std::path::windows::Pathworks with Windows-style pathsstd::path::unix::Pathworks with Unix-style pathsstd::path::Pathis a thin newtype wrapper around the current platform’s path implementation
This organization makes it possible to manipulate foreign paths by working with the appropriate submodule.
In addition, each submodule defines some extension traits, explained below, that supplement the path API with functionality relevant to its variant of path.
But what if you’re writing a platform-specific application and wish to use the extended functionality directly on std::path::Path? In this case, you will be able to import the appropriate extension trait via os::unix or os::windows, depending on your platform. This is part of a new, general strategy for explicitly “opting-in” to platform-specific features by importing from os::some_platform (where the some_platform submodule is available only on that platform.)
Unix
On Unix platforms, the only additional functionality is to let you work directly with the underlying byte representation of various path types:
pub trait UnixPathBufExt {
fn from_vec(path: Vec<u8>) -> Self;
fn into_vec(self) -> Vec<u8>;
}
pub trait UnixPathExt {
fn from_bytes(path: &[u8]) -> &Self;
fn as_bytes(&self) -> &[u8];
}
This is acceptable because the platform supports arbitrary byte sequences (usually interpreted as UTF-8).
Windows
On Windows, the additional APIs allow you to convert to/from UCS-2 (roughly, arbitrary u16 sequences interpreted as UTF-16 when applicable); because the name “UCS-2” does not have a clear meaning, these APIs use u16_slice and will be carefully documented. They also provide the remaining Windows-specific path decomposition functionality that today’s path module supports.
pub trait WindowsPathBufExt {
fn from_u16_slice(path: &[u16]) -> Self;
fn make_non_verbatim(&mut self) -> bool;
}
pub trait WindowsPathExt {
fn is_cwd_relative(&self) -> bool;
fn is_vol_relative(&self) -> bool;
fn is_verbatim(&self) -> bool;
fn prefix(&self) -> PathPrefix;
fn to_u16_slice(&self) -> Vec<u16>;
}
enum PathPrefix<'a> {
Verbatim(&'a Path),
VerbatimUNC(&'a Path, &'a Path),
VerbatimDisk(&'a Path),
DeviceNS(&'a Path),
UNC(&'a Path, &'a Path),
Disk(&'a Path),
}
Drawbacks
The DST/slice approach is conceptually more complex than today’s API, but in practice seems to yield a much tighter API surface.
替代方案
Due to the known semantic problems, it is not really an option to retain the current path implementation. As explained above, supporting UCS-2 also means that the various byte-slice methods in the current API are untenable, so the API also needs to change.
Probably the main alternative to the proposed API would be to not use DST/slices, and instead use owned paths everywhere (probably doing some normalization of . at the same time). While the resulting API would be simpler in some respects, it would also be substantially less efficient for common operations.
未解決的問題
It is not clear how best to incorporate the WTF-8 implementation (or how much to incorporate) into libstd.
There has been a long debate over whether paths should implement Show given that they may contain non-UTF-8 data. This RFC does not take a stance on that (the API may include something like today’s display adapter), but a follow-up RFC will address the question more generally.