Consider the following scenario: a data source connector
is connected to an external source that produces a large volume of data. This connector converts the data format and sends to a router
.
The router
in turn sends the data again to a recipient
which is internally writing to a database cluster.
What happens when the database is unable to write messages fast enough?
Somewhere, something will break. At some point, the router
will either face a significant slow down or no longer be able to send messages on to the recipient
process. Once the router
starts slowing down or failing to send, what will happen to the data source connector
? As the failure starts spreading from right to left, the wider system is likely to become unstable.
This problem often happens when producers are sending data faster than either the network, intermediate process (for example a message bus) or destination process can accept. To solve this, we need to apply backpressure so that the upstream process can understand that there is a problem.
Common definitions
- Wikipedia defines backpressure as "resistance or force opposing the desired flow of fluid through pipes"
- The Reactive Manifesto defines it as "When one component is struggling to keep-up, the system as a whole needs to respond in a sensible way. It is unacceptable for the component under stress to fail catastrophically or to drop messages in an uncontrolled fashion. Since it can’t cope and it can’t fail it should communicate the fact that it is under stress to upstream components and so get them to reduce the load"
Exerting backpressure
There are different techniques to signal backpressure to a publishing application:
- with most messaging systems, producer backpressure is built in and dealt with via the APIs.
- with other systems - for example a HTTP, FIX or WebSocket server - the easiest way to apply backpressure to producers is often simply shedding the load. This could be done selectively for specific clients, or the entire process at a whole.
Handling backpressure
There are several tactics that can applied by producers:
- Drop the all or certain classes of messages in the publisher. This is how LOGback's AsyncAppender operates - if the buffer space is exceeded, TRACE, DEBUG and INFO messages start getting dropped once the buffer is at 80% capacity.
- Increase publisher buffer sizes. This would help in scenarios where there was only a temporary increase in data send rates.
- Slow down the publisher - this is simple if you control the inputs, but can sometimes be very hard to do - for example, if you have a public service.
- Reduce data size. Smaller data means buffers will take longer to fill up. Again, if the database cannot deal with the data volumes long term then this will not help much.
- Use physical disk buffering (for example, Aeron Archive) on the publisher to allow the publisher to write freely while consumer(s) consume as fast as they can. This only works while you have sufficient disk space for the buffering.
- Reduce publisher data volume. Consider conflating the data within the data source connector, so that the output data is a subset of the input data.
- Investigate protocol improvements. For example, instead of having the publisher push data endless, have it push data in batches on the request of consumers.
- Disconnect the slow consumer.