Redis or: How I learned to stop worrying and love the unstable version

Numbers on Twitter are huge. Even with just a 1% sampling of tweets, even working only on tweets with an hashtag, you reach big numbers very fast.

If you want to to offer a highly interactive and very responsive experience you’ll need to keep most of those numbers in memory, and well indexed. So you’re going to need a lot of RAM: I was aware of that when I decided to use Redis as my main DB, and I purposefully chose a parsimonious data structure and a “cloud” – whatever that means – solution that would allow me to increase the amout of RAM when needed without breaking my shoestring budget.

What I didn’t know was that, with just over one GB and a half of data, as measured by Redis itself, I was going to need double that amount of memory not to have my app crash (which happened twice yesterday). Multiply that amount for two servers for a minimum of redundancy, and you’ll need 6 GB of precious RAM for just 1.3 million tags worth of data…

In the past I had found that, after a restart, Redis would use much less memory than before. So, after the obvious quick fix to get back the service up and running – increasing the amount of RAM – I tried to do just that, using the load balancer to keep the service up while restarting the two istances. But the gains from a restart were very short lived, and I didn’t want to pay for that amount of memory and to have to worry that even that wouldn’t be enough during my upcoming 2 weeks trip to Russia.

Checking the info command output after the restarts, I noticed that what was increasing very fast after the new start was the mem_fragmentation_ratio, which started around 1.3 but quickly reached and surpassed 2. After a quick search, it turned out that there is a known memory fragmentation problem with Redis on Linux, caused by malloc.

I had already noticed the disquieting field in the past, but hadn’t realized just how bad the situation was. A ratio over two to one looked really too much to me, especially when a quick search on the internet revealed that a ratio of 1.4 was already considered abnormal.

Delving deeper into the issue, I found out that the fragmentation problem is especially bad with sorted lists, and hashtagify uses a lot of sorted lists: As a matter of fact, of the 4 million keys used right now, more than 90% are sorted lists!

Different versions of Redis were made to find a solution to this issue, but I didn’t find a “definitive” post/article/howto about which is the best one. On the other hand, I noticed that version 2.4, which is not final yet, is incorporating one of those solutions, the use of jemalloc, so I tried it on one of my servers.

Things immediately improved; it’s early to be sure, but for now after many hours the fragmentation ratio is only 1.02. Halving my memory requirements with just a simple version upgrade is not bad at all! After all, I should be able to leave for Russia without the fear of being bankrupt, or with a frozen server, at my return 🙂

I just wished it was easier to learn about this before, but this is the nature of open source software. Now I just hope that this post will help at least someone else not making my mistake of using the “official” 2.2 release.


UPDATE 2011-06-25

After 6 days, 2 hours and 4 minutes of uptime, during which 33,245,499 commands where executed, the fragmentation is still 1.02. I can now confirm that jemalloc rocks 🙂

Greetings from Russia