Failure Detection in Agrona Streaming Applications

PatternNovember 2, 2024 - budding Distributed Systems, Failure Detection, Agrona

Agrona's Agent model is a solid choice for building low latency, high throughput streaming services. I've deployed this application model in a number of environments, most prominently in the market data distribution space.

The applications tend to have a simple design:

  • they source data from one or more upstream services
  • they communicate exclusively using Agrona ring buffers internally using efficient binary formats such as Simple Binary Encoding
  • they transform, aggregate and otherwise manipulate the data into a format suitable for downstream consumers
  • they distribute the data to one or more downstream consumers, typically via Aeron

In a well tuned system, the agents are able to process hundreds of thousands of messages per second and have an internal end-to-end latency (measured from the time the data is received from the upstream service to the time it is sent to the downstream consumer) of less than ten microseconds at 99.99% percentile.

Logical System Model

Detecting failures in these applications can be solved using either a publish only heartbeat or a request/response heartbeat.

In this model, the duty cycle of every agent pushes a heartbeat message to a central Failure Detector agent over dedicated Agrona ring buffers. This Failure Detector agent is responsible for observing the heartbeat messages and detecting failures.

In simple models, a missing heartbeat message is taken as an indication of failure. Alternatively, a more advanced model may compute some basic statistics about the heartbeat messages - using a model such as the Phi Accrual Failure Detector.

Ultimately, once the failure detector agent decides it has detected a failure, it sets a flag that is observed by an external probe, and the external monitoring probe can take an action, such as restarting the process.

Why is is this necessary?

In production systems, sometimes agents can become unresponsive. This might be due to a bug in the application, a CPU starvation issue, an extended GC pause, or something else.

In these situations, it is important to detect the failure quickly so that the problem can be diagnosed and fixed.

Simple Publish Only Failure Detection

In this model, the observed agents will push a heartbeat message to the failure detector agent over an Agrona Ring Buffer. These messages are sent at a fixed interval, for example every 100 milliseconds.

The failure detector agent keeps track of the most recent heartbeat messages per agent, and once an agent's heartbeat message is older than the maximum age, the failure detector agent will consider the agent to be down.

Request/Response Failure Detection

In this model, the observed agents and the failure detector agent engage in a request/response interaction. Each side has a dedicated Agrona ring buffer for sending and receiving messages.

The failure detector agent will send a request message to the observed agent at a fixed interval, for example every 100 milliseconds. If the observed agent does not respond to the request within a timeout period, the failure detector agent will consider the agent to be down.

In my experience this model can be a bit more delicate than the publish only model, leading to more frequent false positives.

How to signal failure to the external probe

With latency sensitive applications, it is important that the failure detector agent can signal the external probe as quickly as possible, without introducing unnecessary latency or garbage.

This is simple to do when the process is running on the same machine as the external probe, and your application is using Aeron. In this case, you can add a Failure Detector Counter to the counters managed by the Aeron Media Driver.

An external application (which could be a lightweight C based application) can then poll the counter and react to the presence of a specific value signalling a failure. The AeronStat C based sample application is a good starting point for building this.


Changelog

  • November 2, 2024Initial outline

The colors used in the diagrams in this post are sourced from Pelicans in the water, Galapagos Islands, Ecuador.