Matt Bostock’s SREcon 2017 Europe talk covers how Prometheus, a metric-based monitoring tool, is used to monitor CDN, DNS and DDoS mitigation provider CloudFlare’s globally distributed infrastructure and network.
The Prometheus metrics-based open source monitoring project has been around since 2012. It is a Cloud Native Computing Foundation (CNCF) member. Prometheus’s dynamic configuration and query language PromQL lets users write complex queries in alerts. CloudFlare provides a content delivery network (CDN), distributed DNS and DDoS mitigation services. This means that its infrastructure is spread across the globe. Monitoring such an infrastructure and its network is complex and the talk describes the role Prometheus plays in this. At CloudFlare, Prometheus has replaced 87% of what Nagios used to do previously.
CloudFlare provides services similar to that of a CDN by using Anycast. Anycast DNS allows DNS queries to be served from the server nearest to the user, whereas Anycast HTTP allows serving content from the server nearest to the user. Acting as an intermediary between the original website and the user, CloudFlare also checks if the visitor’s traffic has threat patterns. It has 116 datacenters across 150 countries, and handles 5 million HTTP requests and 1.2 million DNS requests per second, which add up to 10% of global internet requests. Each point-of-presence (PoP) provides HTTP, DNS, attack mitigation and a key-value store. To monitor this, 188 Prometheus servers are in production at the time of the talk.