sedhossein
4 min readDec 29, 2020

--

Avoiding Prometheus #heavy metrics attack!

avoiding to high cardinality Prometheus label attack

If you have any monitoring system or like to learn more about the monitoring systems, this article can be useful for you. I want to talk about an issue with metric labels in any monitoring system.

We are using Grafana/Prometheus, but this issue is not related to them! It can happen in any similar situations with other tools.

Intro

If we imaging a monitoring system based on Prometheus, presumably we want to monitor our requests, errors, memory/CPU/… usages, etc. Here we can use any metric types that we need based on Prometheus documentation. As a starting point, let's take a look at a very basic usage example(here I use Golang):

package main

import (
"log"
"net/http"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
cpuTemp = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "cpu_temperature_celsius",
Help: "Current temperature of the CPU.",
})
hdFailures = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "hd_errors_total",
Help: "Number of hard-disk errors.",
},[]string{"device"},
)
)

func init() {
prometheus.MustRegister(cpuTemp)
prometheus.MustRegister(hdFailures)
}

func main() {
// ### Attend to this part
hdFailures.
With(prometheus.Labels{"device":"/dev/sda"}).
Inc()
http.Handle("/metrics", promhttp.Handler())
log.Fatal(http.ListenAndServe(":8080", nil))
}

In each request, we increase our error count on device labels and set our CPU temperature. /metrics will be like:

# HELP cpu_temperature_celsius Current temperature of the CPU.
# TYPE cpu_temperature_celsius gauge
cpu_temperature_celsius 33
# HELP hd_errors_total Number of hard-disk errors.
# TYPE hd_errors_total counter
hd_errors_total{device="hard1"} 4
hd_errors_total{device="hard2"} 0
hd_errors_total{device="hard3"} 3
hd_errors_total{device="hard4"} 12

Everything seems good!

Describe the issue

Lets writing a little more complicated case for our metrics(here we need more details for our metrics), So we need more label:

import "github.com/prometheus/client_golang/prometheus/promauto"
import "github.com/prometheus/client_golang/prometheus"
var httpRequests *prometheus.CounterVecfunc init() {
httpRequests = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "http_requests_total",
Help: "HTTP Request count",
}, []string{"status", "method", "handler", "label1", "label2"})
}
func main() {
httpRequests.
WithLabelValues("status", "GET", "/some/route", "XXX", "YYY").
Inc()
http.Handle("/metrics", promhttp.Handler())
}

Here we have more labels for our metric counter, And guess what!

ISSUE: If the value of each label fills with the changeable value from your untrusted clients, it's possible to receive any invalid values for your labels!

For example, Imagine we want to meter requests count on all of our routes, And we got 10 million invalid requests with wrong paths(that make different labels) over time without any visible symptoms because it can happen for a long duration and seems normal error rate in the system. Then for each Prometheus request to/metrics route, your app could print more than 10 million lines! Prometheus could fetch this data in each duration. So you have got a lot of invalid data and unwanted load on your Prometheus.

# HELP http_requests_total HTTP Request count
# TYPE http_requests_total counter
http_requests_total{handler="/some/route",label1="VALUE1-00",label2="VALUE2-01",method="GET",status="200"} 10
http_requests_total{handler="/some/route",label1="VALUE1-00",label2="VALUE2-10",method="GET",status="200"} 22
http_requests_total{handler="/some/route",label1="VALUE1-00",label2="VALUE2-11",method="GET",status="200"} 3
http_requests_total{handler="/some/route",label1="VALUE1-01",label2="VALUE2-00",method="GET",status="200"} 45
http_requests_total{handler="/some/route",label1="VALUE1-01",label2="VALUE2-01",method="GET",status="200"} 53
http_requests_total{handler="/some/route",label1="VALUE1-01",label2="VALUE2-10",method="GET",status="200"} 69
http_requests_total{handler="/some/route",label1="VALUE1-01",label2="VALUE2-11",method="GET",status="200"} 47

And It can continue infinitely!

In this situation, the response time increase too much, and your Prometheus queries slow. Also if your monitoring servers do not have enough resources, it is possible to miss your monitoring!

Simply like a Silent Death

Common Mistakes

In many situations that you can get different values for your labels, it’s possible for this issue happens again. But here I want just note some common mistakes in development time:

  1. When you want to meter the request path, you should be aware to ignore or normalize query params. Like /some/url?p1=v1&unexpected=param
  2. When you want to meter the request path and your application support route query params like /some/{ID}/route you should ignore/normalize {ID} value before setting the path in metric labels
  3. When you want to meter the request path and bad clients try to doing route discovery on your application, probably they get too many 404 errors. you could handle this issue by converting all of them to a single value before adding to metrics. like /not-found

Solution

To solve this problem, you could have validation/normalization on your all labels. This issue happens when your labels may be having a lot of different values, so it's better to control them more cautiously. For example, you can use enum or const variables for your labels to protect their value.

TL; TR

This issue may happen without any attack! It's important to consider this issue in your development times before adding each metric! Check the Common Mistakes part.

--

--