Aeron Process Observability

PatternNovember 6, 2024 - budding Distributed Systems, Aeron, Observability

Observing the state of Aeron based applications can be achieved without adding latency by using a dedicated external process.

Low latency Aeron process observability

Aeron captures a large amount of data about the state of Aeron and the Media Driver. This data can be used to observe the state of the Aeron based application, including the number of messages sent and received, the number of errors, the back pressure event count, etc.

Additionally, applications that use Aeron can also add their own custom counters to the cnc.dat file. These counters can be used to track the state of the application itself, such as the number of business messages processed, the number of application errors, application liveness, etc. Counters stored within the cnc.dat file do not need to be directly related to Aeron.

Constraints and considerations around using this pattern

For this to work, the application process and the observability process must run on the same machine (or if you're running Kubernetes, it must be the same pod on the same Kubernetes cluster node). This is requirement because Aeron uses shared memory to store the data within the cnc.dat file.

If your server is at or near capacity, you may not have enough CPU cycles left to run the observability process without adding latency to your application. While reading the data from the cnc.dat file is relatively cheap, the observability process will need to parse the data and convert it into a format suitable for the observability system, and provide a REST or other API to access the data.

Counter space is a finite resource. By default, Aeron allocates 8192 slots for counters in the cnc.dat file. If your application adds counters, and you do not have enough counter space left, Aeron will not be able to add the new counters. Additionally, if your application does not correctly resource manage the allocated counter space, you may experience paging and other memory related issues.

Note that Aeron supports both ephemeral and persistent counter space. Ephemeral counters are tied to the lifecycle of the Aeron client that created them. Once the Aeron client is closed, the counters are released within a few seconds. Persistent counters are not tied to the lifecycle of a client, and will persist until the cnc.dat file is deleted.

Reading counter data

The counter data is read from the cnc.dat file. The Aeron client provides a countersReader which can be used to read the counter data.

A simple way to read this data is to use the CounterConsumer interface:

@FunctionalInterface
public interface CounterConsumer
{
  /**
   * Accept the value for a counter.
   *
   * @param value     of the counter.
   * @param counterId of the counter
   * @param label     for the counter.
   */
  void accept(long value, int counterId, String label);
}

This can be used on the observability process to read the counter data from the cnc.dat file once the Aeron client has been connected to the Media Driver.

...
  final Aeron aeron = Aeron.connect(aeronCtx);
  // where this::printCounter is an instance method that 
  // matches the CounterConsumer interface
  aeron.countersReader().forEach(this::printCounter);
...

From the printCounter method, you can then publish the counter data in whatever format is compatible with your observability system.


Changelog

  • November 6, 2024Initial outline

The colors used in the diagrams in this post are sourced from The Narrows, Zion National Park, Utah, USA.