SESSION ON-DEMAND

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

Cache Me If You Can: How Grafana Labs Scaled Up Their Memcached 42x & Cut Costs Too

Our cloud database stores billions of files in object storage. With petabytes of data being queried every day, we started bumping into our cloud storage providers’ rate-limits, resulting in decreased reliability & performance. We had large memcached clusters in place to absorb & deamplify reads to object storage – but these could hold at most a few hours’ worth of data, and constantly churned due to the excessive volume of data passing through. The conclusion we came to was: we needed much larger caches, ideally without inflating our cloud costs and adding operational complexity.

I’ll show how we managed to increase our cache size by 45x and reduce our costs by using a little-known feature of memcached called “extstore”. Extstore enables offloading of objects to SSDs which can’t fit into memory. In this talk I’ll be covering how we use it, how to monitor it, why we chose it, and other considerations. I’ll also cover how we use ephemeral storage provided by public cloud vendors in the form of physically-attached SSDs with incredibly high throughput, low latency, and best of all – low cost!

This talk is also a story of how products evolve, and how we as a team are buying time in the short term to keep up our reliability while we evolve our storage design in the medium-long term.

21 minutes
Register for access to all 60+ sessions available on demand.
Fill out the form to watch this session from the P99 CONF 2024 livestream. You’ll also get access to all available recordings.

Danny Kopping, Senior Software Engineer at Grafana Labs

Danny is an engineer at Grafana Labs, based in South Africa. He works on both the Loki open-source product and the Grafana Cloud Logs hosted service. His interests include Go, Linux, playing drums, and the outdoors.