Our cloud database stores billions of files in object storage. With petabytes of data being queried every day, we started bumping into our cloud storage providers’ rate-limits, resulting in decreased reliability & performance. We had large memcached clusters in place to absorb & deamplify reads to object storage – but these could hold at most a few hours’ worth of data, and constantly churned due to the excessive volume of data passing through. The conclusion we came to was: we needed much larger caches, ideally without inflating our cloud costs and adding operational complexity.
I’ll show how we managed to increase our cache size by 45x and reduce our costs by using a little-known feature of memcached called “extstore”. Extstore enables offloading of objects to SSDs which can’t fit into memory. In this talk I’ll be covering how we use it, how to monitor it, why we chose it, and other considerations. I’ll also cover how we use ephemeral storage provided by public cloud vendors in the form of physically-attached SSDs with incredibly high throughput, low latency, and best of all – low cost!
This talk is also a story of how products evolve, and how we as a team are buying time in the short term to keep up our reliability while we evolve our storage design in the medium-long term.