Srinivas Devaki

sre @ zomato

Containing Cardinality Explosion

At zomato we used statsd exporter for envoy metrics initially and later when we had the usecase to report application metrics a couple of years ago, we added a core module in our monolith php codebase to send stats to the statsd exporter sidecar that we are already using for envoy metrics.

But while transitioning to micro services 3 years ago, we faced a choice on whether to use statsd exporter as well for microservices which were written in golang. or since we use golang which can maintain context across requests(unlike php), prometheus golang sdk is also an option to pull metrics. while there is very little difference between both of them we went ahead with statsd exporter as we already use it for envoy metrics and because it helps in reliability for the main application container due to couple of reasons

  1. application container now doesn’t have to rely on mutexes for each metric, in rare cases we thought it could potentially add lock contention
  2. in case of cardinality explosion the main application container can crash due to OOM, which creates impact to the end user while statsd container crash due to cardinality explosion is soft issue with no impact on the end user.

But since reliability is very critical to zomato, we are a bit worried about missing any metrics, so with increasing cases of cardinality explosion we revisited the problem this week to solve it so that human mistakes won’t lead to loss of metrics.

a good read to get better context around the topic. https://www.robustperception.io/cardinality-is-key

One sad reality is supporting high cardinality in time series means high cost. Some databases like timescaledb support high cardinality with as less resource footprint as possible but some like prometheus are not even functional at high cardinality.

Human mistakes in designing metrics often lead to cardinality explosion and cause issues at various components in the prometheus stack like statsd exporter crash, prometheus resource exhaustion, and in rare cases even create impact on main application due to network overhead if prometheus pulls 10’s of MB’s every 10 seconds from each statsd exporter,

So we now know that high cardinality is bad for prometheus, but what can we do about it and how can we safe guard against human mistakes causing cardinality explosion? Since this is an anti-pattern the earliest we detect in the feature development cycle the better. which means we can start to make the code architecture itself prohibiting or being restrictive to high cardinality.

To start out, we can define how a simple counter metric looks like

type CounterI interface {
	Add(n uint64)
	Inc()
}

type ClassicCounter struct {
	Name string
	Labels map[string]string
}

func (c *ClassicCounter) Add(n uint64) {
	c.sendToStatsD(n)
}

func (c *ClassicCounter) Inc() {
	c.Add(1)
}

func (c *ClassicCounter) sendToStatsD() {
	// todo
}

If we can fix this issue of cardinality explosion in the metrics before compilation itself then we would never get to this case of metrics explosion at all. for this we can write code in such a way that metrics with their labels are declared only in the initialisation phase (i.e package scope variables) but not during the runtime phase. since golang doesn’t provide any in-built mechanism to differentiate between these phases (or I was unable to find it on internet :stuck_out_tongue:), we will have to rely on static analysis to catch it ahead of compilation phase. it should be quite easy to verify that ClassicCounter objects are only constructed in package scoped variables. Since cardinality is determined by the amount of unique (name, label) pairs, the ClassicCounter construction must be constructed with name and labels ahead and both of them shouldn’t be changed in future, to simplify the job we can make the ClassicCounter private and have a NewStaticCounter constructor.

func NewStaticCounter(name string, labels map[string]string) CounterI {
    return &classicCounter{
        Name: name,
        Labels: labels,
    }
}

With this even the static analysis job as well simplified by enforcing that NewStaticCounter is only used in the package scoped variables but not anywhere else. Since constructing a counter itself requires name and labels and doesn’t allow us to modify, and so the problem of cardinality explosion is resolved

Not so fast!!

What if one of the label has dynamic values but is finite and always less enough to never cause cardinality explosion. for example country_id is both finite and always less enough to never cause cardinality explosion. with above approach we would have to construct counters ahead for all 195 countries in the world in the package scope. this can really bloat the codebase. so we can’t solve all of the cases by enforcing that NewStaticCounter should only be called in package scope.

Since now we know that it’s not possible to avoid it pre-compilation without bloating the codebase, we have to figure out a way to stop cardinality explosion impacting production. An easy way to do this is by deciding on a sensible cardinality threshold per metric name ahead and dropping all the metrics which exceed this cardinality threshold. also to reduce the impact of a bad metric on a good metric we can isolate the cardinality thresholds on a per metric object.

type DynamicLabeledCounter interface {
    WithLabels(labels map[string]string) CounterI
}

func NewDynamicLabeledCounter(name string, thresh uint32) DynamicLabeledCounter {
    return &dynamicLabeledCounterImpl{
        name: name,
        bag: NewBagWithMaxItems(thresh)
    }
}

type dynamicLabeledCounterImpl struct {
    name string
    bag BagOfCounterI
}

func (c *dynamicLabeledCounterImpl) WithLabels(labels map[string]string) CounterI {
    // GetOrCreate, gets a CounterI if already exists in the bag else creates it, if the bag is full then return a CounterI which sends metrics to /dev/null
    return c.bag.GetOrCreate(c.name, labels)
}

Even though we can implement different cardinality thresholds per metric name, it is still possible that a lot of metrics can be registered in statsd, to avoid this case, we can make a hard check that NewDynamicLabeledCounter is only called in the package scope and during runtime phase by using WithLabels we can easily support the use case of a dynamic label values which are finite and less. this way we can reduce the blast radius of a bad metric while not adding much code complexity.

But now we have 2 types of constructors for a counter and one of the type asks for a threshold and a static analyser rule to only write in package scope. Now even though this approach solves the problem it can also be viewed as an implementation that’s adding significant cognitive overhead to work with a simple counter. We can trade away this implementation complexity by using same principles in the statsd i.e moving all the logic to statsd itself, but it’s not possible to move all the functionality to statsd, by moving we can’t customise the cardinality threshold per metric but have to rely on a common threshold for all metrics. one other missing functionality is that now cardinality explosion by metric naming is possible.

99.99% of the use cases doesn’t require cardinality thresholds per metric so the first missing functionality can be ignored but the later one i.e cardinality explosion by a lot of unique metric names is a reliability concern, one type of bad metrics can lead to good metrics reporting issues. Now coming back to the idea of static analysis, we can enforce that all metric names should be string constants which is actually true and not something developer has to consciously remember but often written that way. if metric name contains any kind of variable static analyser can recommend to use labels for such dynamic fields.

So we can just continue to use the ClassicCounter object itself without any changes while ensuring that all metrics are automatically protected from cardinality explosion.

So the final approach: maintain all cardinality explosion safe guarding logic on the statsd exporter and maintain a simple static analysis enforcer that safe guards against one rate edge case that can’t be built outside the application codebase.

PS:

  1. all examples are pseudo code
  2. customised thresholds per metric name can also be built outside application codebase by extending statsd exporter to support internal labels that will be used for tracking thresholds and stripping away these internal labels before reporting to prometheus. but a common threshold works for most use-cases.
Newer >>