Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments - Colin Douch, Cloudflare
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon North America in Salt Lake City from November 12 - 15, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io
Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments - Colin Douch, Cloudflare
Cloudflare's Thanos journey started back in 2017, with conversations about how we could have a single pane of glass to monitor our new Prometheus infrastructure, replacing our old centralised OpenTSDB instance. Since then, our Prometheus footprint has grown to monitor nearly 500 datacenters around the world, with Thanos continuing to provide that invaluable single pane of glass. Along the way, we've encountered and solved interesting scaling problems arising from running hundreds of geographically dispersed sidecars, collecting tens of billions of active timeseries. In this talk, we will explain these challenges, and present the tooling we have developed to automatically manage and scale our infrastructure. From creating and wiring new buckets and sidecars as we provision new Prometheus servers around the world, to automatically sharding stores as our buckets grow, to utilising our spare CPU capacity to run compactors in locations in non peak hours.