Select Page

I have a need for shared storage of around 300GB worth of 200×200 image files. These files are written once, then resized and stored. Once stored they never change again – they might get deleted.

They get served up to 10 Squid machines and the cache times are huge, like years. This is a very low IO setup in other words, very few writes, reasonably few reads and the data isn’t that big just a lot of files – around 2 million.

In the past I used a DRBD + Linux-HA + NFS setup to host this but I felt there’s a bit too much magic involved with this and I also felt it would be nice to be able to use 2 nodes a time rather than active-passive.

I considered many alternatives in the end I settled for GlusterFS based on the following:

  • It stores just files, each storage brick has just a lot of files on ext3 or whatever, you can still safely perform reads on these files on the bricks. In the event of a FS failure or event your existing tool set for dealing with filesystems all apply still.
  • It seems very simple – use a FUSE driver, store some xattr data with each file and let the client sort out replication, all seems simple
  • I had concerns about FUSE but I felt my low IO overhead would not be a problem as the Gluster authors are very insistent – almost insultingly so when asked on IRC about this – that FUSE issues are just FUD.
  • It has a lot of flexibility in how you can construct data, you can build all of the basic RAID style setups just using machines of reasonable price as storage bricks
  • There is no metadata server, most cluster filesystems need a metadata server on dedicated hardware kept resiliant using DRDB and Linux-HA. Exactly the setup I wish to avoid and those are overkill if all I have is need for a 2 node cluster.

Going in I had a few concerns:

  • There is no way to know the state of your storage in a replicated setup. The clients take care of data syncing not the servers, so there’s no healthy indicator anywhere.
  • To re-sync your data after a maintenance event you need to run ls -lR to read each file, this will validate the validity of each file syncing out any strange ones. This seemed very weird for me and in the end my fears of this was well founded.
  • The documentation is poor, extremely poor and lacking. What there is applies to older versions and the code has had a massive refactor in version 3.

I built a few test setups, first on EC2 then on some of my own VMs, tried to break in various ways, tried to corrupt data and come up with a scenario where the wrong file would be synced etc and found it overall to be sound. I went through the docs and identified any documented shortfalls and verified if these still existed in 3.0 and mostly found they didn’t apply anymore.

We eventually ordered kit, I built the replicas using their suggested tool, set it up and copied all my data onto the system. Immediately I saw that small files is totally going to kill this setup. Doing a rsync of 150GB took many days over a Gigabit network. IRC suggested that if I am worried about the initial build being slow I can use rsync to prep the machines directly individually and then start the FS layer and then sync it with ls -lR.

I tested this theory out and it worked, files copied onto my machines quickly and the ls -lR at the end found little to change according to write traffic to the disks and network, both bricks were in sync.

We cut over 12 client nodes to the storage and at first it was great. Load averages was higher which I expected since it would be a bit slower to respond on IO but nothing to worry about. A few hours into running it all client IO just stopped. Doing a ls, or a stat on a specific file, both would just take 2 or 3 minutes to respond. Predictably for a web app this is completely unbearable.

A quick bit of investigation suggested that the client machines were all doing lots of data syncing – very odd since all the data was in sync to start with so what gives? It seemed that with 12 machines all doing resyncs of data the storage bricks just couldn’t cope, they were showing very high CPU. We shut the 2nd brick in the replica and IO performance recovered and we were able to run but now without a 2nd host active.

I asked on the IRC channel for options on debugging this and roughly got the following options:

  • Recompile the code and enable debugging, shut down everything and deploy the new code which would perform worse, but at least you can find out whats happening.
  • Make various changes to the cluster setup files – tweaking caches etc, these at least didnt require recompiles or total downtime so I was able to test a few of these options.
  • Get the storage back in sync by firewalling the bulk of my clients off the 2nd brick leaving just one – say a dev machine – start the 2nd brick and ls -lR fix the replica, then enable all the nodes. I was able to test this but even with one node doing file syncs all the IO on all the connected clients failed. Eventhough my bricks werent overloaded IO or CPU wise.

I posted to the mailing list hoping to hear from the authors who don’t seem to hang out on IRC much and this was met with zero responses.

At this point I decided to ditch GlusterFS. I don’t have a lot of data about what actually happened or caused it, I can’t say with certainty what events were happening that was killing all the IO – and that really is part of the problem, it is too hard to debug issues in a GlusterFS cluster as you need to recompile and take it all down.

Debugging complex systems is all about data, it’s all about being able to get debug information when needed, it’s about being able to graph metrics, it’s about being able to instrument the problem software. This is not possible or too disruptive with GlusterFS. Even if the issues can be overcome getting to that point is simply too disruptive to operations because the software is not easily managed.

Had the problem been something else – not replication related – I might have been better off as I could enable debug on one of the bricks but as at that point I had just one brick that had valid data and any attempt to sync the second node would result in IO dying it means in order to run debug code I had to unmount all connected clients and rebuild/restart my only viable storage server.

The bottom line is that while GlusterFS seems simple and elegant it is too hard/impossible to debug it should you run into problems. A HA file system should not require a complete shutdown to try out a lot of suggested tweaks, recompiles etc. Going down that route might mean days or even weeks of regular service interruption and that is something that is not suitable to the modern web world. Technically it might be sound and elegant, from an operations point of view it is not suited.

One small side note, as GlusterFS stores a lot of is magic data in x-attributes of the files I found that my GlusterFS based storage was about 15 to 20% bigger than my non GlusterFS ones, that seems a huge amount of waste. Not a problem these days with cheap disks but worth noting.