Photo by Emily Morter on Unsplash

Consul Streaming: What’s behind it?

Pierre Souchay
Criteo R&D Blog
Published in
11 min readFeb 2, 2021

--

Criteo has been using Hashicorp’s Consul to power massive services composed of hundreds of instances watched in real-time by thousands of clients. This pattern emerged across the years with the need to connect instances of applications with East-West network traffic.

Running the Marathon of Performance

With the cardinality of applications increasing, as well as the generalization of containers, the number of instances per service increased quite a lot, while using smaller apps and less memory.

At the same time, since Consul was integrated more and more into all applications, our team was engaged in a marathon to keep up with performance and availability. We were following this moving target with the need of increasing Consul’s performance while the usage in our infrastructure was constantly expanding.

The underlying problem we faced during all that time is the following: when a client is asking to be notified in real-time of the changes that are applied to the instances of a service, it uses a watch. The client’s Consul agent sends a request with an index to a Consul Server asking to be notified if something is changing within a certain delay. Before the expiration, if something changes, the server sends all the data on the wire to the agent.

Workflow of calls between client, Consul agent and Consul servers during watches
How data flows when performing watches on a Service

With an infrastructure like that found at Criteo, a service can be scaled to several hundreds of instances. We are running in our case, thousands of containers. The container orchestration systems (in our case: Mesos/k8s) can schedule a few containers every second or so when deploying a new version of an app. In such conditions, each change can result in a heavy price being paid for CPU as well as the bandwidth to serialize all required data to every consul client which is watching the changes.

Historically, we have taken some quite creative steps to be able to run Consul on our infrastructure:

Whatever we did, we saw an increase in load each time we added new instance(s) into a large service: for each new instance added, the bandwidth and CPU is increased by the number of watchers (clients). This load was very significant: with more than 4000 kinds of services and 260k different instances (and several hundreds of thousands of watchers or clients), our Consul clusters spent much of their time serializing data over the wire anytime something changed in the cluster.

Recently, we have been waiting for the release of a new feature discussed with Hashicorp’s engineers for more than 2 years called “streaming”.

What is streaming?

Streaming is a new way of sending updates of Consul distributed database to the various consul agents composing the cluster. It works the same from the client perspective, but instead of sending the full response every time something is changing, the client subscribes to changes on the service and the server just sends information about the instance(s) that changed in real-time. From the server perspective, it means sending “n” values to the “m” clients waiting for updates instead of sending n*m values. Furthermore, it uses GRPC instead of msgpack serialization, leading to more efficient bandwidth and CPU usage (Protobuf serialization using around 30% less CPU for similar workloads according to our tests).

new model with streaming enabled

Sending only diffs is a big change for an infrastructure where the number of clients is huge because it means the load will increase linearly instead of increasing as the square of instances/clients, so the complexity of updates will change from O(n²) to O(n) from the cluster perspective.

The client just subscribes and the server sends all updates to a given service, with notifications of changes being very small compared to the traditional state being fully sent over the network.

Even better, if several clients are watching the same services on a given agent, the GRPC requests that are sent to the servers are only sent/received once — so it also acts as a cache. Watching 1 time or 10 times the same service has now the same cost for the cluster! Why is that important? Because when you run an agent on a container platform as we do, several services are collocated on the same instance. Many of those instances will use the same set of services, many of our instances will use the same metric service, the same collectors, databases, and all those common services will use only one channel to get the updates, reducing, even more, the load on servers.

The hidden bonus: rate-limiter included!

Historically, on large infrastructures, crafting the SDKs and apps required some caution and care: we spent an enormous amount of time customizing properties, setting up rate-limiters, setting reasonable retry policies to deal with errors…

Currently, the Streaming feature is part of the cache infrastructure of the Consul agents, which means that it can be rate-limited simply by adding a cache directive in the agent’s configuration as described here: https://www.consul.io/docs/agent/options#cache.

The implementation we provided (#8226) is even hot-reloaded by the agent (#8552), so these parameters can be easily fine-tuned without impacting the clients. With caching in mind, this feature was developed to limit the impact of large deployments on the servers (see #7863). While its impact is now low with streaming in mind (each update’s data is sent using diffs and not by sending the whole data-set each time), so the operators tune the behaviour on the client’s side and tune how often the client can receive updates. This was one of our biggest concerns historically: we had to patch tons of software to properly handle watches the way we wanted (see Be a good Consul client), but sometimes it was hard to do (e.g: create pull requests in Prometheus or other systems we could not patch easily or required additional forks). Having a way to control the frequency of updates as seen by clients using the rate-limiter is a great tool to manage scalability better.

Activation of streaming in a Consul 1.9.x cluster

On the server-side, it requires to enable rpc.enable_streaming, once every server configuration has been changed (this setting requires a restart), on each client where you want the feature enabled, cache needs to be enabled and you just add use_streaming_backend = true to your agent’s configuration and then restart the agent. The cache rate-limit directives if set, are used by streaming and will rate-limit the changes seen by the client.

Note that this won’t work right now on vanilla Consul agents as you need additional patches such as #9555 for services that don’t change much will have errors after 10 minutes (But Hashicorp’s team should come back with fixes very quickly).
For now, if you want to test it today, we recommend using the patches applied in this branch: https://github.com/criteo-forks/consul/tree/1.9.1-criteo.

Performance impact

The performance impact was close to what we expected in theory once our patches were applied:

  • Significant reduction of CPU (due to less serialization on servers, deserialization on Consul agents): on agents, when services vary a lot, from 4–5% CPU usage to less than 1%, on servers, the gap is huge and can vary from using more than 50% of all CPUs of a machine (64 CPU machine) to less than 5%!!!
  • Significant reduction of bandwidth between agents and servers: in normal cases, around 30% in real life, but can be divided by 1000x when hitting completely pathological cases (huge services with many changes/sec)

See more detailed benchmarks at the bottom of the article in the Appendix.

What could be better?

There is no auto-discovery of servers running the GRPC protocol, so you need to be sure all your servers have been restarted before activating at least one agent, otherwise, weird errors will happen. In our GRPC patches, we found a clever way to auto-negotiate the protocol by exposing the GRPC feature in Serf (the auto-discovery protocol used by Consul)

The feature is now limited to /v1/health/service/:serviceName endpoint, so it works only for discovery health queries (but that’s the most used endpoint in our clusters, and it will be expanded eventually to other endpoints in next Consul releases).

The proximity of nodes is not applied when using ?near parameter in health queries, so, the order of results is purely lexical and not computed according to node proximity (this is not a feature we are using, so we did not care, but it might be a problem if you rely on it).

The cache duration for streaming in the consul agent is set to 10 min, which exactly corresponds to the maximum duration of watches. But I suspect this value is producing unnecessary cache cleanups and reloads for services not changing much in real life: a watcher watches a non-changed service for 10 minutes, then do something quick and restarts its launch. In the meantime, the cache is cleaned up and streaming is stopped on the agent. So everything restarts when the next watch is triggered on the agent by the client restarting its poll. Maybe trying a bigger value for TTL might fix the issue.

Is it ready yet?

The support is still beta, as it represents a huge change in Consul code: the way data is sent is modified as well as large modifications in caching code.

In order to deploy 1.9.1 in our infrastructure, we also had to perform patches:

  • In the Prometheus support (to avoid our logs being filled): #9510: Merged in 1.9.2
  • TLS negotiation for GRPC protocol (it depends on the settings of your configuration, you might not be impacted): #9512: Merged in 1.9.2
  • Added some warnings when the configuration is wrong (not really needed, just help operators): #9530 : Merged in 1.9.3
  • And finally, more importantly, there is a bug for services not being modified within 10 minutes which is not yet merged right now: #9555 : Not Merged yet.
  • Still a problem with filtering on tags for both DNS queries and /v1/health/service/:serviceName as described in #9695 and #9702. Fix available in #9703 but not yet Merged.

As usual, we have our own fork with those patches included in 1.9.x branches:

Conclusion

While still young, this feature is a game-changer for large and/or very dynamic environments (such as environments with containers being quickly added/removed) as it ensures Consul will be able to handle load more easily. For users of Consul Connect, this feature will eventually bring significant benefits in an always more dynamic environment. We cannot wait for Streaming to become the default setting once it’s fully polished!

We are also very proud of our work, collaboration, and contributions with Hashicorp and we hope other people will benefit from it quickly.

Thanks to:

  • Connal Murphy for his help fixing my poor English skills
  • Paul-Hadrien Bourquin for the benchmarks to write this article
  • Mathilde Gilles for her work on GRPC — giving us enough time to wait for streaming to happen
  • Nicolas Benoit, Paul-Hadrien Bourquin, Yohan Chuzeville, and Mathilde Gilles for their work within the Discovery Team
  • and many others that helped us

Appendix: Benchmark results

To measure the impact, we did the test on:

  • some Terraform clusters with fake services
  • one of our pre-production clusters
  • our production cluster

We want to benchmark the impact of the streaming on a pathological case that occurs when services belonging to some large pools are being redeployed, triggering a significant number of events in a short span of time.

We expect the streaming to significantly reduce the average traffic & CPU consumption between servers & agents watching for this service. Let’s see!

Scenario:

  1. Disable rate limit on cache updates: we want to be aware of all events without being rate limited.
  2. Register a dummy service with 5000 instances.
  3. On a Consul server, loop to increment every second the metadata for a given instance of this service. This will trigger tons of notifications on watching agents.
  4. Start watching the service from 2 agents, one with streaming and one without.

The benchmarks are done with a 15 min duration. Here are the results.

Terraform benchmarks:

In this scenario, we are ‘in isolation’ i.e our cluster only has the fake service without external noise — other services living their best lives.

Streaming results tests on Terraform

We can observe drastically diminished bandwidth between the two configurations, approx 2GB cumulative received.

Pre-production benchmarks:

Here we have lots of other services registered in the cluster. This explains the much larger traffic overall for both tests.

Interface Usage Graph during the test: Left with streaming, Right without Streaming activated
Detailed bandwidth between agents and servers: Left with Streaming, Right without

We can observe once again a ~2GB difference between the two configurations, with streaming enabled being the best option, streaming is using a third of the bandwidth!

CPU Usage

CPU comparison: on the left with streaming, on the right without

Here the CPU usage doesn’t go very high as we have 64 CPUs machines and we’d have to really push the number further if we wanted to have a bigger impact on the machines — but we can definitely see a higher CPU average without streaming, and by a huge margin: from 2–4% to around 0.2/0.4% … CPU used by the Consul agent is almost divided by 10x!

Production: In a non-pathological scenario

Here we simply watch all services on our cluster using https://github.com/criteo/consul-templaterb, without forcing updates on a given service, so it reflects real data using in Criteo’s production usage.

Protocol

  • 4 instances of Consul-UI watching in prod 2 consul-relay for 12 hours
  • 7 consul servers, 4600+ Consul agents
Comparison of bandwidth usage in production with streaming (top) and without (bottom) during 12h

Conclusion

As expected theoretically, streaming performs much better than the previous approach and servers can handle better the load.

Thanks for reading! If you want to join our projects, check out our open positions:

--

--