Thursday, July 12, 2012

FreeBSD NAS/SAN with ZFS (part 1)

Part 1: Initial Setup

I had to set up a NAS/SAN recently using a Dell PowerEdge 1900. The ultimate goal is to have CIFS (through Samba 3.x), NFS, and iSCSI support. With so few drives, performance will not be amazing and 8GB of RAM is a bit low for ZFS, but I needed cheap, fairly reliable storage. I can recover from a system crash fairly easily, and I can restore the data from backups, if necessary. I generally advise against using large drives for RAID devices because of the long rebuild times in the event of failure. However, this system is designed to provide cheap storage, so RAIDZ2 will have to do.

The pros of this sytem:

  • Cheap
  • Easy to set up
  • fairly reliable (though there are single points of failure) - RAIDZ2 will permit two failures before data loss
  • reasonable performance for a fileserver
  • flexible - the system has the entire ports tree
The cons:
  • at least two single points of failure (a single system drive, albeit a SSD and a single power supply, though I will be rectifying the power supply issue shortly)
  • FreeBSD claims the system drive does not support TRIM. If I had chosen a traditional platter based drive, this wouldn't matter. This means I might see some performance issues on the system drives after a certain amount of time. Since I won't be doing much on the system drive, this may not be much of an issue eve
  • Performance will be mediocre with so few spindles (I'd love to have 15 or more in the pool, but I don't have an external JBOD array available)
  • rebuild time will be long, due to the 2TB drives
  •  no dedicated hot spare (no free ports on the SAS controller)

It's important to note that Nexenta community edition is an option, as is FreeNAS (I've ruled out OpenFiler because I've had so many issues with it over the years.) However, I had the following issues:

  1. Nexenta CE is not to be used for production systems
  2. FreeNAS is on an older revision of ZFS than FreeBSD proper

The system is configured like so (it's not the greatest in terms of high availability):

  • 1x Intel Xeon E5335 processor (quad core @2GHz)
  • 8GB RAM (FB-DIMMs)
  • 1 Dell SAS 5i card (LSI based)
  • 7x Hitachi 2TB 7,200RPM SATA drives
  • 1x OCZ Vertex 30GB SSD 
  • 2x Intel Gigabit server NICs

It's important to note that the chassis really only supports 6 3.5" drives, so I had to use drive brackets to mount the remaining 2TB drive & SSD. It would be best to use a pair of SSD drives (the SSD drive is for the base OS) but I didn't have any free ports left on the SAS controller. I can highly recommend that you do a pair of drives for redundancy.

Alternatively, I could use the on board SATA, but I'm actually okay with the system functioning like an appliance (i.e., I'll back up base configuration (/etc /usr/local/etc) and the ZFS pools.) You can, of course, use a USB stick for the base OS, if you'd like. Ultimately, I'd probably want to use SSDs for the caches, but I ran out of drive

The steps were:

  1. Download FreeBSD 9.0 AMD64 and burn to a DVD
    1. DVD 1 ISO
    2. Or... use the memory stick version
    3. or... use the boot only version and do a net install
  2. Boot the machine  from the DVD
  3. Install the system. I prefer to partition unix systems if possible. I really dislike have an errant log file filling up the single filesystem. It's certainly convenient to not have to carve out space, but if I got that route, I prefer to have a properly managed filesystem like ZFS so I can create new volumes and set quotas on them
 I partitioned it like so:

Filesystem           Size    Used   Avail Capacity  Mounted on
/dev/da0a              2G    356M    1.5G    19%    /
devfs                1.0k    1.0k      0B   100%    /dev
/dev/da0d            503M    4.1M    459M     1%    /tmp
/dev/da0e            4.9G    303M    4.2G     7%    /var
/dev/da0f            7.9G    2.5G    4.7G    35%    /usr
/dev/da0g              2G     16M    1.8G     1%    /home

Your needs may vary. I enabled TRIM support on all the filesystems, but FreeBSD complains about that and claims that the drive does not support it. I'm pretty sure it does. I'll likely have to address this issue later.

After partitioning and installing the OS, I set up user accounts, used freebsd-update fetch and freebsd-update install to apply security patches, and finally used portsnap to create and update the ports tree.

3a. set up necessary components in /etc/rc.conf:


 4. I set up all the necessary networking components I wanted, such as NTPD and NIC teaming/bonding (failover only.)

5. Zpool creation:

These 2TB drives have 4K sectors, but emulate 512bytes for maximum OS compatibility. I wanted to align the ZFS pool to match the 4K sectors for optimal performance. I found numerous discussions online, but this page was the most straightforward, at least for my purposes.

alternatively, there is this howto:

basically, you create GNOP providers (man page) with 4096k sector sizes, export the ZFS pool, destroy the gnop providers and re-import the pool.

Here's the list of drives:

sudo camcontrol devlist Password:               at scbus0 target 0 lun 0 (da0,pass0)         at scbus0 target 1 lun 0 (da1,pass1)         at scbus0 target 2 lun 0 (da2,pass2)         at scbus0 target 3 lun 0 (da3,pass3)         at scbus0 target 4 lun 0 (da4,pass4)         at scbus0 target 5 lun 0 (da5,pass5)         at scbus0 target 6 lun 0 (da6,pass6)         at scbus0 target 7 lun 0 (da7,pass7)    at scbus2 target 0 lun 0 (cd0,pass8)  

 I created my pool like so:

# use a for loop, if you'd like # If I were provisioning many more than 7 drives, I'd probably just do a for loop, too sudo gnop create -S 4096 /dev/da1 sudo gnop create -S 4096 /dev/da2 sudo gnop create -S 4096 /dev/da3 sudo gnop create -S 4096 /dev/da4 sudo gnop create -S 4096 /dev/da5 sudo gnop create -S 4096 /dev/da6 sudo gnop create -S 4096 /dev/da7 sudo zpool create dpool1 raidz2 /dev/da1.nop /dev/da2.nop /dev/da3.nop \
  /dev/da4.nop /dev/da5.nop /dev/da6.nop /dev/da7.nop sudo zpool export dpool1 sudo gnop destroy /dev/ /dev/da1.nop /dev/da2.nop /dev/da3.nop \
  /dev/da4.nop /dev/da5.nop /dev/da6.nop /dev/da7.nop sudo zpool import dpool1

sudo zdb -C dpool1 | grep ashift
should return 12...

I have a working pool:

sudo zpool

> sudo zpool status dpool1   pool: dpool1  state: ONLINE  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Jun 30 11:32:09 2012 config:         NAME        STATE     READ WRITE CKSUM         dpool1      ONLINE       0     0     0           raidz2-0  ONLINE       0     0     0             da1     ONLINE       0     0     0             da2     ONLINE       0     0     0             da3     ONLINE       0     0     0             da4     ONLINE       0     0     0             da5     ONLINE       0     0     0             da6     ONLINE       0     0     0             da7     ONLINE       0     0     0 errors: No known data errors

I did some benchmarks. Performance was acceptable for my purposes. I'd really want to get the spindle count much higher if I were using for something like a DB server or a large hypervisor.

The next section in this series will cover the teaming/bonding in FreeBSD.


C. Hayre said...

Great post! I am currently building a NAS myself, and several of your points echo some of my sentiments. One that I failed to consider ahead of time is the rebuild time on such a large drive (I went 3TB). You should be fairly insulated from data loss though, with RAIDZ2. After reading this I'm almost wondering if I should pick up an extra disk for the same reason, since my original plan was RAIDZ2.

On the OS decision, I'm still going back and forth between NAS4Free and FreeBSD. NAS4Free may help with some of my FreeBSD noobism, but I'm drawn to just doing it myself with FreeBSD (native). I've also been considering using ZFSguru as an aid/gui for FreeBSD, but am not sold on it. FreeNAS is out because of the 8.x thing, and because I'm fairly unimpressed with the project since rewrite.

Anonymous said...

Hi there!

Hopefully i can answer some questions for you.

Sandforce SSD
First, your SSD is a weak point in your setup. You've chosen a rather unreliable SSD (Sandforce). This is fine when you use the SSD for caching (L2ARC) but not as primary OS storage.

Worse, you won't even detect corruption as you use old-style UFS filesystem on your SSD. This is not a setup i would recommend.

Your SSD does not support TRIM, because it is connected to a SCSI/SAS controller and not an AHCI or ATA controller. TRIM is an ATA command, thus you need to connect it to your chipset ports in ATA ('IDE') or AHCI mode.

With the absence of TRIM you should partition your SSD to leave a large portion of space unused and unpartitioned. This is to prevent it being written to. All space you do not write to, will be used as spare space by the SSD. By partitioning only 70% of your SSD you are giving a guaranteed 30% back to your SSD. This is known as overprovisioning. If you are using L2ARC this is highly recommended. Note that you need a brand new SSD or a TRIM/Secure Erased SSD. Otherwise, data written to those sectors won't disappear.

You can TRIM-erase your SSD with commands:

newfs -E -b 65536 /dev/adaX
your device name has to be /dev/ad or /dev/ada. If your device is /dev/da then you have SCSI/SAS interface.

The solution
Go for an all-ZFS system with no legacy filesystems like UFS being present. You can do this by installing FreeBSD directly on a ZFS pool, known as 'Root-on-ZFS'. You can use the 'mfsbsd' script as well as distro's as ZFSguru or FreeNAS (not sure about that one).

With the OS on your pool, the OS also gains redundancy and protection against corruption. You can even enhance this protection by setting copies=2, assuming your boot filesystem is limited in size anyway.

With your SSD free as being relieved of its 'system disk' duties, you can assign it as L2ARC caching for your main pool. The unreliable nature of Sandforce doesn't matter in this case. Corrupt data from the SSD will be detected and data will be read from HDD pool instead. Just make sure your RAM is corruption-free.

David Wankmüller said...

Hi, can you tell which OS runs on the client? And did you discovered any performance issues on how the data is cached? We use freebsd 9(installed on a usb flash device) and zfs v28 with nfs4. For example, when i do `grep /mnt/file` it produces network traffic. When i do it again it produces the same traffic while with ufs or other filesystems the data is cached on the client. This reduces the performance extremely when working with data on a mounted filesystem. This behavior only occurs on linux clients. When using freebsd as a client everything works fine. 

Anonymous said...


How did you enable TRIM on FreeBSD 9?

Rivald said...

use the tunefs command like so:

1. unmount the filesystem in question

2. tunefs -t enable /dev/my_filesystem

you can also do tunefs -t enable /my/filesystem/mountpoint