1
Written by: Neil Brown <neilb@suse.de>
6
This document describes a prototype for a new approach to providing
7
overlay-filesystem functionality in Linux (sometimes referred to as
8
union-filesystems). An overlay-filesystem tries to present a
9
filesystem which is the result over overlaying one filesystem on top
12
The result will inevitably fail to look exactly like a normal
13
filesystem for various technical reasons. The expectation is that
14
many use cases will be able to ignore these differences.
16
This approach is 'hybrid' because the objects that appear in the
17
filesystem do not all appear to belong to that filesystem. In many
18
cases an object accessed in the union will be indistinguishable
19
from accessing the corresponding object from the original filesystem.
20
This is most obvious from the 'st_dev' field returned by stat(2).
22
While directories will report an st_dev from the overlay-filesystem,
23
all non-directory objects will report an st_dev from the lower or
24
upper filesystem that is providing the object. Similarly st_ino will
25
only be unique when combined with st_dev, and both of these can change
26
over the lifetime of a non-directory object. Many applications and
27
tools ignore these values and will not be affected.
32
An overlay filesystem combines two filesystems - an 'upper' filesystem
33
and a 'lower' filesystem. When a name exists in both filesystems, the
34
object in the 'upper' filesystem is visible while the object in the
35
'lower' filesystem is either hidden or, in the case of directories,
36
merged with the 'upper' object.
38
It would be more correct to refer to an upper and lower 'directory
39
tree' rather than 'filesystem' as it is quite possible for both
40
directory trees to be in the same filesystem and there is no
41
requirement that the root of a filesystem be given for either upper or
44
The lower filesystem can be any filesystem supported by Linux and does
45
not need to be writable. The lower filesystem can even be another
46
overlayfs. The upper filesystem will normally be writable and if it
47
is it must support the creation of trusted.* extended attributes, and
48
must provide valid d_type in readdir responses, at least for symbolic
49
links - so NFS is not suitable.
51
A read-only overlay of two read-only filesystems may use any
57
Overlaying mainly involved directories. If a given name appears in both
58
upper and lower filesystems and refers to a non-directory in either,
59
then the lower object is hidden - the name refers only to the upper
62
Where both upper and lower objects are directories, a merged directory
65
At mount time, the two directories given as mount options are combined
66
into a merged directory:
68
mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper /overlay
70
Then whenever a lookup is requested in such a merged directory, the
71
lookup is performed in each actual directory and the combined result
72
is cached in the dentry belonging to the overlay filesystem. If both
73
actual lookups find directories, both are stored and a merged
74
directory is created, otherwise only one is stored: the upper if it
75
exists, else the lower.
77
Only the lists of names from directories are merged. Other content
78
such as metadata and extended attributes are reported for the upper
79
directory only. These attributes of the lower directory are hidden.
81
whiteouts and opaque directories
82
--------------------------------
84
In order to support rm and rmdir without changing the lower
85
filesystem, an overlay filesystem needs to record in the upper filesystem
86
that files have been removed. This is done using whiteouts and opaque
87
directories (non-directories are always opaque).
89
The overlay filesystem uses extended attributes with a
90
"trusted.overlay." prefix to record these details.
92
A whiteout is created as a symbolic link with target
93
"(overlay-whiteout)" and with xattr "trusted.overlay.whiteout" set to "y".
94
When a whiteout is found in the upper level of a merged directory, any
95
matching name in the lower level is ignored, and the whiteout itself
98
A directory is made opaque by setting the xattr "trusted.overlay.opaque"
99
to "y". Where the upper filesystem contains an opaque directory, any
100
directory in the lower filesystem with the same name is ignored.
105
When a 'readdir' request is made on a merged directory, the upper and
106
lower directories are each read and the name lists merged in the
107
obvious way (upper is read first, then lower - entries that already
108
exist are not re-added). This merged name list is cached in the
109
'struct file' and so remains as long as the file is kept open. If the
110
directory is opened and read by two processes at the same time, they
111
will each have separate caches. A seekdir to the start of the
112
directory (offset 0) followed by a readdir will cause the cache to be
113
discarded and rebuilt.
115
This means that changes to the merged directory do not appear while a
116
directory is being read. This is unlikely to be noticed by many
119
seek offsets are assigned sequentially when the directories are read.
121
- read part of a directory
122
- remember an offset, and close the directory
123
- re-open the directory some time later
124
- seek to the remembered offset
126
there may be little correlation between the old and new locations in
127
the list of filenames, particularly if anything has changed in the
130
Readdir on directories that are not merged is simply handled by the
131
underlying directory (upper or lower).
137
Objects that are not directories (files, symlinks, device-special
138
files etc.) are presented either from the upper or lower filesystem as
139
appropriate. When a file in the lower filesystem is accessed in a way
140
the requires write-access, such as opening for write access, changing
141
some metadata etc., the file is first copied from the lower filesystem
142
to the upper filesystem (copy_up). Note that creating a hard-link
143
also requires copy_up, though of course creation of a symlink does
146
The copy_up may turn out to be unnecessary, for example if the file is
147
opened for read-write but the data is not modified.
149
The copy_up process first makes sure that the containing directory
150
exists in the upper filesystem - creating it and any parents as
151
necessary. It then creates the object with the same metadata (owner,
152
mode, mtime, symlink-target etc.) and then if the object is a file, the
153
data is copied from the lower to the upper filesystem. Finally any
154
extended attributes are copied up.
156
Once the copy_up is complete, the overlay filesystem simply
157
provides direct access to the newly created file in the upper
158
filesystem - future operations on the file are barely noticed by the
159
overlay filesystem (though an operation on the name of the file such as
160
rename or unlink will of course be noticed and handled).
163
Non-standard behavior
164
---------------------
166
The copy_up operation essentially creates a new, identical file and
167
moves it over to the old name. The new file may be on a different
168
filesystem, so both st_dev and st_ino of the file may change.
170
Any open files referring to this inode will access the old data and
171
metadata. Similarly any file locks obtained before copy_up will not
172
apply to the copied up file.
174
On a file is opened with O_RDONLY fchmod(2), fchown(2), futimesat(2)
175
and fsetxattr(2) will fail with EROFS.
177
If a file with multiple hard links is copied up, then this will
178
"break" the link. Changes will not be propagated to other names
179
referring to the same inode.
181
Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory
182
object in overlayfs will not contain vaid absolute paths, only
183
relative paths leading up to the filesystem's root. This will be
186
Some operations are not atomic, for example a crash during copy_up or
187
rename will leave the filesystem in an inconsitent state. This will
188
be addressed in the future.
190
Changes to underlying filesystems
191
---------------------------------
193
Offline changes, when the overlay is not mounted, are allowed to either
194
the upper or the lower trees.
196
Changes to the underlying filesystems while part of a mounted overlay
197
filesystem are not allowed. If the underlying filesystem is changed,
198
the behavior of the overlay is undefined, though it will not result in