Redis Persistence Latency
— 04:25 reading time
Redis has become an increasingly important part of our production stack at Leapfrog. We use it in two critical functions: as a session store and as the transport medium for our microservice implementation (PSH). The throughput of Redis was a big draw when looking at technologies to use for PSH and the list data types are also very useful for feeding tasks to workers. However, Over time we started to identify some performance issues with task throughput. Overall, the performance was great but we would occasionally see spikes of 20+ seconds when pushing a new task (a small JSON blob with parameters) onto the end of a list. Thus began the great redis investigation of 2015.
Under The Microscope
The entire ordeal began as we prepared to migrate datacenters. Every component of the infrastructure was under the microscope to make sure that the new data center was running properly. As our primary platform (FOX) was being tested, we started taking note of these latency spikes. We noticed the latency, recognized that we were seeing a similar pattern in our current environment and decided that it should not block the migration.
With the migration complete and some free cycles we started taking a look at the problem. The most observable was the latency spikes in task time on our Grafana dashboards. But we were also seeing random redis timeouts when publishing events. The first piece of business was to see if we could reproduce the problem in QA. No one likes performing surgery on a live patient.
Luckily we had load testing scripts already written to send traffic at QA and we were able to quickly reproduce the latency patterns. The investigations could now begin in earnest. We are by no means redis experts so the tried and true practice of trial and error went into full force.
The first thing we wanted to do was make use of the latency tools that redis comes with but we were using an old version. So we upgraded to redis 3 and tried to see what the latency tools told us. Nothing. For some reason when we were seeing spikes in task latency we were not seeing any latency events logged by redis.
Well that was helpful. What to try next…
We observed that the latency became worse/more frequent as the memory usage started to climb. What if the RDB persistence is causing the problem? Well that is an easy thing to check, let’s just turn it off!
AH HA! Spikes went away.
With a possible cause in sight we did some reading to better educate ourselves about how redis actually does its backups. For rdb persistence, at configured intervals, redis will fork its entire memory space, write a backup file to disk. If that file is written successfully, it is copied over the previous backup. A fairly intensive process especially when redis starts using more memory.
We identified a possible cause and decided to address the problem by reducing the frequency of persistence from every sixty seconds to once an hour. This has been successful in reducing the impact of the spikes but may not be the entire story.
As we were trying to reproduce the problem we set up a number of different benchmarks that, to varying degrees, mimicked what PSH was doing. The only version of the benchmark that reflected the latency we were seeing was the version that used blpop. All of our service workers use blpop to grab tasks off their respective list/queue in redis. Might there be some strange interaction between persistence, blpop and inability to rpush items onto those queues?
For now we are satisfied with the current fix but I get the feeling we may have not uncovered the full problem. With the impact of the problem mitigated we have to move onto other priorities but if we have a recurrence we will most likely start by investing blpop.
Feel free to get in touch if you have had a similar experience and know more.
Selecting an appropriate data store for your intended use is important. We currently use redis for more than just transient task transportation and this investigation has called those decisions into question. For optimal performance we considered disabling persistence on redis entirely and only using it for applications where the data is so transient that restoring from scratch in case of an incident is perfectly acceptable. That forces us to update implementations that expect redis to have persistent data and use data stores that are more suited to that purpose.
Before adding a new data store to your stack carefully consider the nature of the store to ensure that it matches with your intended use. Using redis when you really need a stable relational store like PostgreSQL is a bad decision and you should feel bad for making it (we do).
Director, Software Engineering