{"id":1789,"date":"2010-09-22T09:11:19","date_gmt":"2010-09-22T08:11:19","guid":{"rendered":"http:\/\/www.devco.net\/?p=1789"},"modified":"2010-09-22T09:11:19","modified_gmt":"2010-09-22T08:11:19","slug":"experience_with_glusterfs","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2010\/09\/22\/experience_with_glusterfs.php","title":{"rendered":"Experience with GlusterFS"},"content":{"rendered":"
I have a need for shared storage of around 300GB worth of 200×200 image files. These files are written once, then resized and stored. Once stored they never change again – they might get deleted. <\/p>\n
They get served up to 10 Squid machines and the cache times are huge, like years. This is a very low IO setup in other words, very few writes, reasonably few reads and the data isn’t that big just a lot of files – around 2 million.<\/p>\n
In the past I used a DRBD + Linux-HA + NFS setup to host this but I felt there’s a bit too much magic involved with this and I also felt it would be nice to be able to use 2 nodes a time rather than active-passive.<\/p>\n
I considered many alternatives in the end I settled for GlusterFS based on the following:<\/p>\n
Going in I had a few concerns:<\/p>\n
I built a few test setups, first on EC2 then on some of my own VMs, tried to break in various ways, tried to corrupt data and come up with a scenario where the wrong file would be synced etc and found it overall to be sound. I went through the docs and identified any documented shortfalls and verified if these still existed in 3.0 and mostly found they didn’t apply anymore.<\/p>\n
We eventually ordered kit, I built the replicas using their suggested tool, set it up and copied all my data onto the system. Immediately I saw that small files is totally going to kill this setup. Doing a rsync of 150GB took many days<\/b> over a Gigabit network. IRC suggested that if I am worried about the initial build being slow I can use rsync to prep the machines directly individually and then start the FS layer and then sync it with ls -lR<\/em>.<\/p>\n I tested this theory out and it worked, files copied onto my machines quickly and the ls -lR at the end found little to change according to write traffic to the disks and network, both bricks were in sync.<\/p>\n We cut over 12 client nodes to the storage and at first it was great. Load averages was higher which I expected since it would be a bit slower to respond on IO but nothing to worry about. A few hours into running it all client IO just stopped. Doing a ls, or a stat on a specific file, both would just take 2 or 3 minutes to respond. Predictably for a web app this is completely unbearable.<\/p>\n A quick bit of investigation suggested that the client machines were all doing lots of data syncing – very odd since all the data was in sync to start with so what gives? It seemed that with 12 machines all doing resyncs of data the storage bricks just couldn’t cope, they were showing very high CPU. We shut the 2nd brick in the replica and IO performance recovered and we were able to run but now without a 2nd host active.<\/p>\n I asked on the IRC channel for options on debugging this and roughly got the following options:<\/p>\n I posted to the mailing list hoping to hear from the authors who don’t seem to hang out on IRC much and this was met with zero responses.<\/p>\n At this point I decided to ditch GlusterFS. I don’t have a lot of data about what actually happened or caused it, I can’t say with certainty what events were happening that was killing all the IO – and that really is part of the problem, it is too hard to debug issues in a GlusterFS cluster as you need to recompile and take it all down. <\/p>\n Debugging complex systems is all about data, it’s all about being able to get debug information when needed, it’s about being able to graph metrics, it’s about being able to instrument the problem software. This is not possible or too disruptive with GlusterFS. Even if the issues can be overcome getting to that point is simply too disruptive to operations because the software is not easily managed.<\/p>\n Had the problem been something else – not replication related – I might have been better off as I could enable debug on one of the bricks but as at that point I had just one brick that had valid data and any attempt to sync the second node would result in IO dying it means in order to run debug code I had to unmount all connected clients and rebuild\/restart my only viable storage server.<\/p>\n The bottom line is that while GlusterFS seems simple and elegant it is too hard\/impossible to debug it should you run into problems. A HA file system should not require a complete shutdown to try out a lot of suggested tweaks, recompiles etc. Going down that route might mean days or even weeks of regular service interruption and that is something that is not suitable to the modern web world. Technically it might be sound and elegant, from an operations point of view it is not suited.<\/p>\n One small side note, as GlusterFS stores a lot of is magic data in x-attributes of the files I found that my GlusterFS based storage was about 15 to 20% bigger than my non GlusterFS ones, that seems a huge amount of waste. Not a problem these days with cheap disks but worth noting.<\/p>\n","protected":false},"excerpt":{"rendered":" I have a need for shared storage of around 300GB worth of 200×200 image files. These files are written once, then resized and stored. Once stored they never change again – they might get deleted. They get served up to 10 Squid machines and the cache times are huge, like years. This is a very […]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","footnotes":""},"categories":[1],"tags":[90,33],"_links":{"self":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/1789"}],"collection":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/comments?post=1789"}],"version-history":[{"count":8,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/1789\/revisions"}],"predecessor-version":[{"id":1797,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/1789\/revisions\/1797"}],"wp:attachment":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/media?parent=1789"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/categories?post=1789"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/tags?post=1789"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}\n