Powered by Blogger.

Monday, May 24, 2010

Tried ZFS on Linux?



Jeff Bonwick, the leader of the team at Sun Microsystems that developed ZFS, called it “…the last word in filesystems.”  It is indeed worthy of the praise considering its advanced yet easily maintainable features. ZFS, a pseudo-acronym for what was earlier called Zettabyte Filesystem, is a 128-bit filesystem, as opposed to the presently available 64-bits filesystems like ext4 and others. Some of its excellent features include:
  • Simplified administration: ZFS has a well-planned hierarchical structure with the uberblock (parent of all blocks) and disk label at the top, followed by pool-wide metadata, the filesystem’s metadata, directories and files. The uberblock checksum is used as the digital signature for the entire filesystem. Besides property inheritance (utilising the hierarchical structure), ZFS provides auto management of mounting, sharing, compressions, ACLs, quotas and reservations, etc, making administration easier and more effective. The filesystems in ZFS can be compared to directories in ordinary filesystems like ext3, and most administration tasks are done using just two commands—zfs and zpool.
  • Pooled storage: ZFS has revolutionised the filesystem implementation and its management with the introduction of storage pools. Concepts like datasets (a generic term for volumes, filesystems, snapshots and clones) and pools (a large storage area available for the datasets) make filesystem handling easier for the administrator. Like the virtual memory model for a process, the filesystem can grow its usage space as required without any pre-determined space limits unless provided as ‘quotas’ within the pool model. ‘Quotas’ can be set, changed or removed at will. Also, a minimum ‘reservation’ space for each filesystem can be specified. One important aspect of the storage pool is the removal of volume management architecture, thus reducing a lot of complexity for the administrator.
  • Transactional paradigm: ZFS being a transactional filesystem is guaranteed to be consistent according to its developers. Data management in ZFS uses copy on write semantics, which ensure that data is never overwritten, always maintaining an old reference to the data. A sequence of filesystem operations is either committed or ignored as a whole, thereby preventing any corruption to the filesystem due to power shortage or some other outage. This, in effect, removes the need for the fsck tool, the traditional filesystem check and repair tool.
  • Scrubbing and self-healing: Since data and even metadata is checksummed, data scrubbing (an operation that checks data integrity within a filesystem or, in other words, data validation) is performed easily within ZFS. Checksum algorithms can be any user-selected algorithm from SHA-256 to fletcher2, producing 256-bit long checksums. Besides checking for data integrity and preventing silent corruption, ZFS also provides mechanisms for self healing, mainly through RAID-Z and mirroring. Two RAID-Z variations, single and double-parity, are in fact slight variations of RAID-5 and RAID-6, respectively. The variations mainly aim to eliminate the write hole, solidifying data integrity. Besides, techniques like resilvering or resyncing help in replacing a corrupted or faulty device with a new one.
  • Scalability: The team behind ZFS made the decision to go for a 128-bit filesystem, even though 64-bit filesystems like ext4 have come up only recently. Its data limit is an enormous 256 quadrillion zettabytes of storage which, is almost an impossible limit to reach in the near future since fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans, as Bonwick pointed out. Directories can have up to 248 (256 trillion) entries. No limit exists on the number of filesystems or number of files that can be contained within a filesystem.
  • Snapshots and clones: Snapshot is a read-only copy of a filesystem or volume at any particular point of time. Its design is such that space is consumed only when data is changed, preventing any freeing of data from the filesystem unless explicitly asked, giving further options for maintaining data integrity. Clone is a writable filesystem generated from a snapshot. The creation of snapshots and clones in ZFS is very simple and is always pointed out as one of its big advantages.

ZFS and Linux

ZFS is the standard filesystem for Solaris/OpenSolaris OS whose source code is published under CDDL (Common Development and Distribution License). However, from the beginning (and hopefully forever) the Linux kernel has remained licensed under the GPLv2, which prevents any other code to be linked with the GPL’d Linux kernel unless that code’s licence is GPL v2 compatible. So the open sourced code of ZFS cannot be added/linked to the kernel code like any other filesystem, either as a part of the kernel or as kernel modules. As a workaround, some solutions pointed out by the open source community are:
  1. A ‘court ruling’ (either in the US or EU, where ZFS is mainly used) stating that GPL and CDDL are compatible.
  2. Either of the parties (Linux and Solaris) need to change the licence of their code to a mutually compatible one.
  3. A GPL’d ZFS reimplementation from scratch, which should be free from all the 56 patents that Sun has taken on ZFS code.
  4. A method by which we would be able to implement ZFS to be usable for Linux, which is only possible through dynamic linking between the codes—this is allowed.
The possibility of Options 1 and 2 are remote, compelling us to choose between Options 3 or 4. As a solution like that suggested in Option 3, a project named BTRFS, guided by Chris Mason at Oracle is under development, having been merged to a “rc” pre-release of the current Linux kernel (2.6.30), and is under testing. Definitely, this is going to take a long time as ZFS itself was under development for five years. Solution 4, which is through a utility called FUSE and seems the most stable option as of now, is what I am going to discuss as we go on.

FUSE

Filesystem in USErspace, or FUSE, helps implement a fully-functional filesystem in a userspace program rather than directly in the kernel. It is implemented in OSs like Linux, FreeBSD, etc. Its components (as of version 2.7.4) consist of a FUSE kernel module, a FUSE library containing libfuse and libulockmgr, and a special file descriptor like a device file in Linux named /dev/fuse, used for communication between the kernel module and the userspace library. For user convenience, a program named ‘fusermount’ is provided along with the FUSE package as an easy usermode tool to link up between the user-defined filesystem and the FUSE module.

ZFS on FUSE

ZFS on FUSE is a project under development by Ricardo Manuel da Silva Correia, a computer engineering student, and is sponsored by Google as part of Google Summer of Code 2006. So after completion of this project, ZFS will have a port on the FUSE framework, which effectively will mean operating systems like Linux can use ZFS.

How it works

The zfs-fuse daemon acts like a server, managing ZFS on the system through the FUSE framework. Every filesystem operation on the mounted ZFS devices from any application will be through the standard C library system calls. This results in calling the kernel’s appropriate function from the virtual filesystem (VFS) interface, which will then be hooked to the FUSE module and, in turn, acts like a filesystem module through a special purpose device named /dev/fuse. This device acts as a bridge between the ZFS implementation and fuse module. The fuse module communicates with the ZFS filesystem implementation (which in this case is zfs-fuse), through the FUSE library libfuse which has functions similar to that of VFS’s interface. The user program returns results for the filesystem request in the required format through the FUSE framework to the application.

Getting started

 It is available in two forms, as a release version packed as a bzip file or directly in source form from the Mercurial repository. Installing from the source requires that we use scons instead of make, though the command and options are almost the same for both. It’s better you read the README and INSTALL files in the source directory before proceeding. Besides, for certain distributions like Gentoo, Debian, Fedora, Ubuntu, etc, zfs-fuse is available via the regular package management system making the installation much easier. Please use your package manager and search for “zfs”.

Installation on Fedora 10

As I was using Fedora 10 while testing ZFS, my commands and configuration files are more specific to Fedora, though with minor variations the same should apply to most distros.
First install the zfs-fuse package using the command [all commands from here on should be executed as the root user, unless otherwise mentioned]:
yum install zfs-fuse
This installed zfs-fuse version 0.5 on my system that has Fedora 10.

Setting up ZFS

Before executing any commands, it should be verified that zfs-fuse daemon is running.
pgrep zfs-fuse
If it’s not, issue the following code:
service zfs-fuse start
…or directly run the script file as follows:
/etc/init.d/zfs-fuse start

Managing ZFS

After making sure that the zfs-fuse daemon is running, we need to have a ZFS pool comprising of one or more devices. We will create a pool, say ‘K7’, representing a group with many users, each having their own filesystems on ‘K7’. A user, say ‘ajc’, will have his own filesystem, which will be mounted under ‘K7’ with the same user name along with the required properties.
zpool create K7 sda10 This will create a pool named ‘K7’ using the /dev/sda10 device. You can also give the full path as /dev/sda10 instead of just sda10. However, it’s not required since zfs-fuse will search for any device by default in this directory. If the -n option is specified after create, then no pool will be created. This will cause just a dry-run, which ends up showing the layout of ZFS after the execution of that command. By issuing the above command, we not only created a pool but also implicitly created a dataset (more specifically, a filesystem) too, which will be mounted by default at location ‘/K7’. It is important to avoid any pool name whose name clashes with directories under / [root directory]. However, if you want to explicitly specify the mount point, say at /mnt/k7 or elsewhere, then execute the following:
zpool create -m /mnt/k7 K7 sda10 …or if pool ‘K7’ already exists:
zfs set mountpoint=/mnt/k7 K7 However, after this, K7 won’t be mounted anywhere. So we need to issue either auto mount on all filesystems by issuing the following command:
zfs mount -a …or any specific filesystem as:
zfs mount K7 For unmounting we use the unmount option instead of mount in the above commands.
Also, at any point in time, if you want to list all the pools in your system, execute the command given below:
zpool list The health status of the pool can be checked with the following:
zpool status This command can take the optional arguments -x and -v for a quick overview and verbose status, respectively.
Since we have created a pool named ‘K7’ along with a filesystem with the same name and mounted it at /mnt/k7, to properly harvest the pool we may need more filesystems suitably named in the pool ‘K7’. This can be achieved by using the dataset specific command zfs rather than the pool command zpool.
For example:
zfs create K7/ajc …will create a filesystem mounted at a sub-directory ‘ajc’ in a directory where K7 is mounted, which in our case will be /mnt/k7/ajc. Similar to specifying mounting options for pools as mentioned above, filesystems also have options like:
zfs create -o mountpoint=/mnt/k7/ajc K7/ajc Or if you want to change the mount point of an already created filesystem, use:
zfs set mountpoint=/mnt/k7/ajc K7/ajc It is quite possible that after some time the space you allocated for the pool may run out. Using the in-built compression can be a temporary, yet ready-made solution for such a situation.
zfs set compression=on K7/ajc Another way to tackle this is to add devices to the pool with the required device space, which will be added to the space already available.
zpool add K7 sda11 Also, as a counter operation to add, we also have remove to remove any added devices from the pool but with the restriction that removal can be performed only on hot spare (that is, inactive devices made active when the system is degraded) devices.
Like mountpoint and compression, many other properties of a filesystem like ‘quota’, ‘reservation’, etc, can also be set as:
zfs set quota=3G K7/ajc zfs set reservation=1G K7/ajc Properties of a filesystem can be viewed using get as follows:
zfs get quota K7/ajc And to see all properties, issue the following command:
zfs get all K7/ajc As mentioned earlier, ZFS gives a lot of importance to data validation, which is also called scrubbing, and this can be performed on any of the filesystems using the command scrub:
zpool scrub K7 If at any point you want to see all the commands you issued on pools, use:
zpool history Or for a particular pool like K7, issue the following:
zpool history K7 Likewise, use iostat to get a count of I/O operations on pools.
Now, for creating a snapshot of any filesystem, we can issue:
zfs snapshot K7@snap1 The snapshot of a filesystem is represented by its name followed by ‘@’ and then the snapshot name. Use the -r option to create snapshots recursively on all filesystems under the specified filesytem, as in what’s shown below:
zfs snapshot -r K7@snap2 Now, after a lot of changes to the filesystem, if you want to go back to a snapshot of the filesystem, issue the rollback command. The -r switch is required, as we have to remove the newer snapshot ‘snap2’ to roll back to ‘snap1’.
zfs rollback -r K7@snap1 Or if the snapshot you are rolling back is the newest of all the snapshots of the filesystem, then use the following:
zfs rollback K7@snap2 As in the listing of pools, datasets (which include fileystems and snapshots) can be displayed using the command given below:
zfs list The snapshot created can be easily transferred between pools or even between systems using the commands send and recv. The following command will create a new filesystem ‘K7_snap’ under ‘K7’ from the snapshot ‘snap1’:
zfs send K7@snap1 | zfs recv K7/K7_snap The following command is the same as the one above, but the new filesystem and snapshot will be in a remote system ‘sreejith’:
zfs send K7@snap1 | ssh root@sreejith zfs recv K7/K7_snap As we know, ZFS is the native filesystem of Solaris and if we want to migrate any pool storage in Solaris to some other OS like Linux, then we’ll have to first export the pool from Solaris or whatever OS in which it was being used, and then import it to the required OS.
zpool export K7 In order to forcefully export ‘K7’, we can use the -f switch with the above command.
The following command will display all importable pools with their name and ID:
zpool import …and we can import it using the name (or even the ID), issuing the command below:
zpool import K7 And finally, the destroy command is used to destroy a pool or a filesystem. The following destroys the ‘ajc’ filesystem in ‘K7’:
zfs destroy K7/ajc …while the next command destroys the K7 pool altogether:
zpool destroy K7 Though ZFS on FUSE manages to implement a lot of the features of native ZFS, it is still not complete, as has been pointed out in the status of the project. Since the implementation is in userspace, which has to be linked to the Linux kernel through the FUSE module, the performance and scalability is not at par with the kernel module implementation of other filesystems as of version 0.5. Even then, the project is a nice way to get acquainted with the revolutionary ZFS in operating systems like Linux. However, it is expected that a properly tuned ZFS on FUSE may have a comparable performance to the native filesystems as in the case of NTFS-3G, a freely and commercially available and supported fast handling read/write NTFS driver for Linux, FreeBSD, MacOS, etc.
 

0 comments

Post a Comment