The Status of Storage Within Linux
We evaluate LVM, Btrfs, and ZFS from the perspective of a desktop user. And we look at the pros and cons of the different storage technologies.
When installing any Linux distribution, an often overlooked and misunderstood component of the configuration process is setting up the storage of the system.
In the early days of Linux, there were a few options to consider. Such as whether to format your partitions as EXT3, XFS, or if very adventurous, a variant of the ReiserFS linage. If one wanted some form of software RAID, the only option to consider was Linux mdadm. And setting up system partitions, such as a boot partition, was straight forward since things like UEFI and the EFI system partition were not standard.
Today, we have a lot more options to consider. With the development of more advanced and all-encompassing filesystems such as LVM, Btrfs, and ZFS, we are able create more purpose built storage architectures for our specific use cases. As an example, the physical storage composition of our systems can vary from containing drives dedicated for capacity and some for speed. Understanding these storage technologies can be important for increasing the performance of our systems as a whole. In addition, we can use these filesystems to protect ourselves from losing data due to drive loss, a failed system update, or even user error.
In this post, we are going to look at the current state of storage within Linux. We will evaluate LVM, Btrfs, and ZFS from the perspective of a desktop user. And we will look at the pros and cons of the different technologies and how we can use them to our advantage.
Logical Volume Manager (LVM)
LVM offers an additional management layer in between the physical drives and the filesystems initialized on the system.
There are a few key concepts within LVM to first understand: physical volumes, volume groups, and logical volumes. A physical volume represents any storage device that is initialized within the context of the LVM subsystem. Volume groups are made up of physical volumes, and represent the total combined capacity of a particular LVM subsystem. Finally, logical volumes utilize the storage pools defined by the volume groups. Logical volumes can be initialized with a particular RAID, mirroring, or striping configuration for data redundancy and performance.
By default, logical volumes will reserve from the volume group all of the space specified during creation. We can use thin pools, which are special types of logical volumes, to dynamically grow and reserve space as the system needs. This allows us to over allocate space within a Linux storage system.
Advantages of LVM
LVM can provide significant advantages compared to using only a traditional filesystem.
One of the first advantages is that LVM provides a way to dynamically add or remove storage from a system. To do this, we extend the volume group to add a new physical volume. Afterwards, the logical volume can be resized, which is analogous to resizing of a partition. And finally, the filesystem can be resized to gain the extra storage at that mount point. This can be a very convenient method of expanding storage, for example, on a virtual machine that has run out of space.
Another important feature is that of snapshots. A snapshot can be created on a live system. And it can be referenced at a later point and be written to or reverted to. Snapshots are a great way to prevent errors if created before upgrading system packages, or even to protect from deleting an important file.
Many systems these days will have both SSDs for faster storage and traditional spinning HDDs for capacity. With LVM, we can leverage this by setting up caching on logical volumes such that read and write performance is increased while also allowing for full capacity of the data drives. It is important to note that some configurations of a writeback cache will require that the underlying storage is reliable; such as those in a redundant RAID configurations and on backup power. Because a loss of such a writeback cache drive could cause filesystem data loss.
LVM has been apart of the Linux ecosystem for a long time. Due to its maturity, LVM is found in many popular distributions by default. As a product of Red Hat, LVM plays a large role in that of CentOS as well as Red Hat Enterprise Linux (RHEL). LVM can be used in combination with other technologies, such as Red Hat's VDO, to offer things like compression and deduplication. Finally, with LVM, you still may use any of the traditional filesystems such as XFS and EXT4, which are familiar by the majority of users in the Linux ecosystem.
Disadvantages of LVM
One disadvantage of LVM, which one should be aware of, is the issue of bit rot. Bit rot is due to data degradation at the physical drive level. LVM RAID utilizes the traditional mdadm software RAID which comes with caveats for finding and repairing such data degradation. According to the
The repair mode can make the RAID LV data consistent, but it does not know which data is correct. The result may be consistent but incorrect data. When two different blocks of data must be made consistent, it chooses the block from the device that would be used during RAID intialization. However, if the PV holding corrupt data is known, lvchange --rebuild can be used in place of scrubbing to reconstruct the data on the bad device.
To summarize, with the LVM stack and a RAID configuration, we are able to detect inconsistencies in a replicated block of data. Yet the LVM utilities are only able to make a best guess as to which of the blocks is correct to make the system consistent. If the user knows which of the blocks is good, they may manually rebuild the block. We will see in the next sections how Btrfs and ZFS solve this issue using additional filesystem metadata.
The final disadvantages are due to the way snapshots work in LVM. Like logical volumes, a snapshot is exposed as a block device. Transferring a LVM snapshot for backup purposes, in an incremental and efficient way, can be difficult because of this fact. Also, there are several reports of having significant snapshots on a system causing performance degradation. Though I have not found any reputable investigations of this issue, so your results may vary. Maybe this is something we can look into in a future post ;-)
Summary of LVM
|✅ Live Snapshotting of Partitions||❌ No Corruption Detection|
|✅ Read and Write Caching to Faster Storage|
|✅ Stability of Features|
As we have seen, LVM is a great tool for expanding the functionalities of traditional filesystems. It represents an old-school modular way of configuring a storage stack within Linux. Where features are added by creating additional logical abstractions of the storage in-between the physical disk and the filesystems.
LVM is a great option for use cases where bit rot is not a high priority issue. Or if the underlying storage is highly reliable, as the case may be for public cloud environments, such as AWS or Google Cloud.
Btrfs is a newer filesystem developed specifically for the Linux ecosystem. It offers many of the same benefits of LVM, but the features are included in the filesystem itself.
A volume is defined during initialization of the filesystem using
mkfs.btrfs. We pass in any number of block devices, as well as define the data replication strategies of this volume. Different strategies can be defined for the filesystem metadata and the actual file data of a particular Btrfs volume.
|Profile||Copies||Parity||Striping||Space utilization||Min/max devices|
|DUP||2 / 1 device||50%||1/any|
|RAID0||1 to N||100%||2/any|
|RAID10||2||1 to N||50%||4/any|
|RAID5||1||1||2 to N-1||(N-1)/N||2/any|
|RAID6||1||2||3 to N-2||(N-2)/N||3/any|
The way Btrfs handles mirrored RAID profiles is different than is traditionally defined. Btrfs works on the concept of data copies for mirrored RAID profiles. For example, if a RAID1 volume is created with three equally sized disks, each data block will have one associated copy, and the volume will be able to handle a single disk failure. This is different than a traditional RAID1 configuration in which all three disks would hold the same copy of the data, and the system would be able to handle two disk failures.
After initialization of the volume, a default top-level subvolume is created. A subvolume is considered a namespaced portion of the volume and can have it's own snapshots or subvolumes anywhere within its own mounted directory structure. As a general rule, a subvolume is created to isolate the data within it, and can have separate mount options and snapshots.
A snapshot is a special type of subvolume. A snapshot is created by targeting an existing subvolume. Afterwards, a Copy-on-write pointer to the subvolume is created as it exists at that moment in time. This uses no extra storage and is a very quick operation. The two subvolumes will appear to contain the same data, but the files within can be updated or deleted without impacting the other. We can then revert to a particular snapshot by mounting it's subvolume and removing the original.
Advantages of Btrfs
Unlike LVM, Btrfs generates checksums for each data and metadata block of the filesystem. With this checksum, Btrfs is able to detect silent corruption from bit rot, and fix it if a valid block has been replicated to another storage device. This is done automatically on all reads of data as a system is being used. And should be run periodically with the scrub command, to verify data that is rarely read from the filesystem.
Btrfs uses a replication strategy in which chunks are balanced across any number and size of devices in a volume. With traditional RAID, it's recommended to keep device sizes the same. With Btrfs, as long as the amount of devices with free space satisfies the data replication strategy of the volume, Btrfs will handle balancing the data across all of the storage devices. Since Btrfs is flexible in this way, converting to a different RAID level, and adding or removing devices, can be done very easily by using the
btrfs balance command.
A strong advantage of Btrfs is its inclusion of mount options which are available for each subvolume. Btrfs offers mount options which can enable ssd optimizations, discard support, as well as options for disabling checksums for an entire mount point. Another popular option is that of compression. Zlib compress is enabled by default, and the filesystem will use heuristics to determine whether a file should be compressed or not and mark it as such. Similar to converting RAID levels, we can use
btrfs balance to re-write our data using a different compression algorithm.
The final advantage Btrfs brings to the table is the ability for conversion of Ext3, Ext4, and ReiserFS filesystems to Btrfs. Running the
btrfs-convert utility will create a Btrfs subvolume with the existing filesystem data on it and a default subvolume will be snapshotted from it. This operation can be completely reverted back to the original filesystem, if needed.
Disadvantages of Btrfs
Btrfs has a few notable gotchas which a user will want to know about.
The first disadvantage is the stability of the parity RAID5 and RAID6 implementations. Utilizing these causes a system to be vulnerable to drive and power failures. In which the filesystem may become corrupted beyond repair. This is known as the "write hole".
Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are two distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever.
According to the Btrfs wiki quoted above, a "write hole" event may happen with RAID5/6 in which an improper shutdown could cause a mismatch between the parity and data to happen. The general advice from developers, if an individual wants to utilize a parity RAID implementation, is to utilize a RAID1 type profile, for example RAID1C3, for the metadata. And then RAID5 or RAID6 can be used for the file data. This way, in case of a write hole event, the filesystem metadata is protected while the actual filesystem contents may be scrubbed immediately on boot in order to make the filesystem consistent again.
Another gotcha concerning RAID that could catch a user off-guard, is the fact that a RAID1 volume may only be mounted once as read-writable in a degraded state. For example, if a two disk system experiences a single drive failure, the system may only be mounted as degraded once until the drive is replaced. Obviously, it is always advised to replace your as soon as possible. But this is an important limitation to know if you are expecting your systems to be functional before redundancy is fixed.
Finally, Btrfs does not currently support any type of caching to faster storage. A user is free to mix storage types in a single volume, but the filesystem will not do any type of prioritization of the data onto the faster drives.
Summary of Btrfs
|✅ Copy on Write Snapshots||❌ Parity RAID Vulnerabilities|
|✅ Corruption Protection||❌ Caching Not Natively Supported|
|✅ Conversion between RAID levels and EXT4|
As we have seen, Btrfs is a very exciting filesystem which provides many advanced functionalities to the Linux ecosystem. It combines a lot of wanted properties that exists in storage manages, like LVM, into a flexible filesystem solution.
Btrfs offers a convenient method of creating on-demand snapshots of a system. It also fixes the issue of bitrot, by establishing checksums, as a means of determining whether a block has been corrupted. And it uses redundancy profiles in order to fix any corrupted data. Btrfs has flexibility of how it can represent data on disk and offers convenient methods of converting from one redundancy profile to another. As well as converting from existing filesystems, such as EXT4 or ReiserFS.
As Btrfs is developed, it will continue to grow and improve on the stability of its features. Already, Fedora has changed the default filesystem of it's distribution to Btrfs in Fedora 33. So this is an exciting storage technology which will only increase in popularity in the desktop space in the upcoming years.
ZFS is an open-source storage technology that came out of Solaris in the early 2000s. In 2005, under new control of Oracle, it was transferred to a closed-source model. At this time, various forks and ports of the ZFS specification came to fruition. The development of these ZFS implementations are now under the umbrella of OpenZFS.
ZFS is a complex filesystem with many different concepts which are important to understand before setup and deployment.
The first concept is that of vdevs. In ZFS, vdevs encompass a set of disks and can have a particular redundancy profile associated. A vdev supports mirroring, single parity (RAIDZ1), double parity (RAIDZ2), and triple parity (RAIDZ3) block redundancies.
Another important concept is that of a zpool. A zpool has one or more vdevs underneath it to provide the underlying storage. A zpool does not have any particular redundancy associated; that is the responsibility of only the vdevs. A zpool will distribute writes across the vdevs, mostly according to the free space available to each at the time. But it is important to understand that there is no guarantee of how the data will be written and should not be confused as true data stripping. A zpool is flexible in that vdevs under it can be dynamically added to the pool and can be of varying sizes and use any redundancy technique.
A third concept is that of datasets. A dataset is very similar to the concept of a subvolume within Btrfs. It can be used to subdivide the data within the file hierarchy so that a snapshot of the dataset will only contain what is within that dataset. A dataset can also be used to define different filesystem properties, such as compression, encryption, and even deduplication. As an example, here is one the datasets defined for one my desktop:
[patrick@summit ~]$ zfs list -o name,mountpoint NAME MOUNTPOINT bpool /boot bpool/sys /boot bpool/sys/BOOT none bpool/sys/BOOT/default legacy rpool / rpool/sys / rpool/sys/DATA none rpool/sys/DATA/default / rpool/sys/DATA/default/home /home rpool/sys/DATA/default/root /root rpool/sys/DATA/default/srv /srv rpool/sys/DATA/default/usr /usr rpool/sys/DATA/default/usr/local /usr/local rpool/sys/DATA/default/var /var rpool/sys/DATA/default/var/lib /var/lib rpool/sys/DATA/default/var/lib/docker /var/lib/docker rpool/sys/DATA/default/var/log /var/log rpool/sys/DATA/default/var/spool /var/spool rpool/sys/DATA/default/var/tmp /var/tmp rpool/sys/ROOT none rpool/sys/ROOT/default /
The final concept is that of zvols. A zvol exists at the same level as a dataset and is used to expose a raw block device to the system. This can be helpful if you want to have, for example, a separate swap partition on top of ZFS.
Advantages of ZFS
ZFS is quite an advanced filesystem and has support for more features than just about any filesystem to date. Like Btrfs, ZFS is based on the concept of Copy-on-write. The greatest benefit of this is that it allows for instantaneous live snapshots to be created from any dataset. Unlike Btrfs, ZFS takes the concept a step further and applies CoW to each data write operation.
Copy-on-write in ZFS isn't only at the filesystem level, it's also at the disk management level. This means that the RAID hole—a condition in which a stripe is only partially written before the system crashes, making the array inconsistent and corrupt after a restart—doesn't affect ZFS. Stripe writes are atomic, the vdev is always consistent, and Bob's your uncle.
According to an article by Ars Technica, entitled ZFS 101—Understanding ZFS storage and performance, due to the way that ZFS commits data to disk using Copy-on-write mechanisms, the filesystem is not affected by the write hole problem unlike Btrfs. In fact, the parity implementations in ZFS, denoted as RAIDZ, are highly lauded by the community and are considered very stable.
An additional advantage ZFS brings to a system is advanced caching hierarchy for both read and write operations. Traditionally, filesystems let the kernel page cache handle caching of recently used blocks and file metadata. ZFS takes another route and utilizes in-memory caches using more advanced algorithms. This read cache is considered the Adaptive Replacement Cache (ARC) and is the main reason that ZFS is known to like a lot of RAM. When an eviction of the cache occurs, one can configure a high performance L2ARC device, which extends this read cache to hold more frequently read data.
For speeding up synchronous write operations, ZFS utilizes a special device known as the Secondary Log (SLOG) device. A user may register a high endurance and write optimized persistent storage device to handle the ZFS Intent Log (ZIL) on the SLOG device. This ZIL is simply a journal of write transactions to the ZFS pool. The ZIL may be referenced on crash of a system, and is used to keep the filesystem consistent and the data within it reliable. It is important to note this cache is only for synchronous write operations, which are used heavily within databases and virtual machine virtual disks.
Finally, ZFS offers many different features which may optionally be enabled on a per dataset definition. Compression, deduplication, and encryption can all be enabled on each dataset. Likewise to Btrfs, ZFS checksums all filesystem data in order to protect against bit rot.
Disadvantages of ZFS
ZFS is a without a doubt a tremendous filesystem with many features and a stable implementation. That being said, there are a few disadvantages which a user should be aware of.
The first is that ZFS is under the open-source CDDL license. Unfortunately the CDDL license is considered incompatible with the GPL license. This poses problems because ZFS is not legally allowed to be included with the Linux kernel. Instead, it's advised that a user should use the ZFS Dynamic Kernel Module (DKMS) packages to load the required kernel code on boot. This present the following problems:
- The ZFS kernel modules must be included in the initramfs.
- The ZFS kernel modules must be independently rebuilt for each new kernel release.
The first issue only presents itself as a problem if the user compiles their own kernels. In which case, they need to always remember to include the ZFS DKMS packages with their custom kernels, especially if booting from or have root mounted as ZFS. The second is a problem that many users of rolling distributions, for example Arch Linux, can face. If a new Linux kernel is released, there may be a period of time when a system cannot be updated due to conflicts. This is due to the independently maintained ZFS DKMS packages lagging behind. It's an unfortunate reality due to a simple license incompatibility.
The final disadvantage has to do with expanding the storage of a zpool for RAIDZ configurations. ZFS is not as flexible on adding and removing storage as Btrfs or LVM. As detailed in the article entitled "The 'Hidden' Cost of Using ZFS for Your Home NAS", a vdev cannot be easily changed after initialization of a parity RAID configuration. This means that a disk cannot be added later down the road when you are ready to expand storage. One cannot simply add a disk and rebalance like Btrfs. You will need to create an additional vdev, ideally of the same RAIDZ profile type, and add that vdev to the zpool. One other option, is replacing each drive, one at a time, with a larger drive, and expanding a vdev that way. But this is wholly inefficient in time as a re-silvering is required for each drive replace.
Summary of ZFS
|✅ Copy on Write Snapshots||❌ License Incompatibilities with GPL|
|✅ Corruption Protection||❌ Least Flexible Storage Expansion|
|✅ Read and Write Caching|
|✅ Stability of Features|
ZFS is an amazing swiss-army knife for storage. It is able to handle all storage concerns well and offers many features that one would want in a filesystem. Its feature sets were the basis of comparison to newer filesystems like Btrfs. And will continue to be used and have a strong following by storage enthusiasts.
ZFS makes a great use case for NAS devices where data integrity is critical. And excels with large storage pools with caching devices used for boosting read and write performance. It is also used in high regard as local persistent storage for a cloud platforms such as used by Proxmox. Where it is able to handle on-demand snapshots and offer high performance storage configurations.
It's certainly an exciting time to be around with such accessible storage technologies to the end user. With projects like LVM, Btrfs, and ZFS in use and development, it is important to understand these technologies to see what their use cases are in relation to the others. And more importantly, to learn from them the concepts that they bring to the Linux ecosystem. And how we can set up our own systems for better reliability, performance, and usability.
Thanks for reading. I really appreciate feedback on what you think about these conclusions, and what your own use cases, both at work and in your own home lab, have been!