Calyptia Core Agent
23.10
Search
K

Buffering & Storage

The end-goal of Calyptia Core Agent is to collect, parse, filter and ship logs to a central place. In this workflow there are many phases and one of the critical pieces is the ability to do buffering: a mechanism to place processed data into a temporary location until is ready to be shipped.
By default when Calyptia Core Agent processes data, it uses Memory as a primary and temporary place to store the records, but there are certain scenarios where it would be ideal to have a persistent buffering mechanism based in the filesystem to provide aggregation and data safety capabilities.
Choosing the right configuration is critical and the behavior of the service can be conditioned based in the backpressure settings. Before we jump into the configuration let's make sure we understand the relationship between Chunks, Memory, Filesystem, and Backpressure.

Chunks, memory, filesystem, and backpressure

Understanding the chunks, buffering, and backpressure concepts is critical for a proper configuration. Let's do a recap of the meaning of these concepts.

Chunks

When an input plugin (source) emits records, the engine groups the records together in a Chunk. A Chunk size usually is around 2MB. By configuration, the engine decides where to place this Chunk, the default is that all chunks are created only in memory.

Irrecoverable chunks

There are two scenarios where Calyptia Core Agent marks chunks as irrecoverable:
  • When Calyptia Core Agent encounters a bad layout in a chunk. A bad layout is a chunk that does not conform to the expected format. Chunk definition
  • When Calyptia Core Agent encounters an incorrect or invalid chunk header size.
In both scenarios, Calyptia Core Agent will log an error message and then discard the irrecoverable chunks.

Buffering and memory

As previously mentioned, the Chunks generated by the engine are placed in memory, but this is configurable.
If memory is the only mechanism set for the input plugin, it will just store data as much as it can there (memory). This is the fastest mechanism with the least system overhead, but if the service is not able to deliver the records fast enough because of a slow network or an unresponsive remote service, Calyptia Core Agent memory usage will increase since it will accumulate more data than it can deliver.
In a high load environment with backpressure the risks of having high memory usage is the chance of getting killed by the Kernel (OOM Killer). A workaround for this backpressure scenario is to limit the amount of memory in records that an input plugin can register, this configuration property is called mem_buf_limit. If a plugin has enqueued more than the mem_buf_limit, it won't be able to ingest more until that data can be delivered or flushed properly. In this scenario the input plugin in question is paused. When the input is paused, records will not be ingested until it is resumed. For some inputs, such as TCP and tail, pausing the input will almost certainly lead to log loss. For the tail input, Calyptia Core Agent can save its current offset in the current file it is reading, and pick back up when the input is resumed.
Look for messages in the Calyptia Core Agent log output like:
[input] tail.1 paused (mem buf overlimit)
[input] tail.1 resume (mem buf overlimit)
The workaround of mem_buf_limit is good for certain scenarios and environments, it helps to control the memory usage of the service, but at the costs that if a file gets rotated while paused, you might lose that data since it won't be able to register new records. This can happen with any input source plugin. The goal of mem_buf_limit is memory control and survival of the service.
For full data safety guarantee, use filesystem buffering.
Here is an example input definition:
[INPUT]
Name tcp
Listen 0.0.0.0
Port 5170
Format none
Tag tcp-logs
Mem_Buf_Limit 50MB
If this input uses more than 50 MB memory to buffer logs, you will get a warning like this in the Calyptia Core Agent logs:
[input] tcp.1 paused (mem buf overlimit)
Mem_Buf_Limit applies only when storage.type is set to the default value of memory. The following section explains the limits that apply when you enable storage.type filesystem.

Filesystem buffering to the rescue

Filesystem buffering enabled helps with backpressure and overall memory control.
Behind the scenes, Memory and Filesystem buffering mechanisms are not mutually exclusive. Indeed when enabling filesystem buffering for your input plugin (source) you are getting the best of the two worlds: performance and data safety.
When Filesystem buffering is enabled, the behavior of the engine is different. Upon Chunk creation, the engine stores the content in memory and also maps a copy on disk (through mmap(2)). The newly created Chunk is (1) active in memory, (2) backed up on disk, and (3) is called to be up which means "the chunk content is up in memory".
How does the Filesystem buffering mechanism deal with high memory usage and backpressure? Calyptia Core Agent controls the number of Chunks that are up in memory.
By default, the engine allows us to have 128 Chunks up in memory in total (considering all Chunks), this value is controlled by service property storage.max_chunks_up. The active Chunks that are up are ready for delivery and the ones that are still receiving records. Any other remaining Chunk is in a down state, which means that it is only in the filesystem and won't be up in memory unless it is ready to be delivered. Remember, chunks are never much larger than 2 MB, thus, with the default storage.max_chunks_up value of 128, each input is limited to roughly 256 MB of memory.
If the input plugin has enabled storage.type as filesystem, when reaching the storage.max_chunks_up threshold, instead of the plugin being paused, all new data will go to Chunks that are down in the filesystem. This allows us to control the memory usage by the service and also provides a guarantee that the service won't lose any data. By default, the enforcement of the storage.max_chunks_up limit is best-effort. Calyptia Core Agent can only append new data to chunks that are up; when the limit is reached chunks will be temporarily brought up in memory to ingest new data, and then put to a down state afterwards. In general, Calyptia Core Agent will work to keep the total number of up chunks at or below storage.max_chunks_up.
If storage.pause_on_chunks_overlimit is enabled (default is off), the input plugin will be paused upon exceeding storage.max_chunks_up. Thus, with this option, storage.max_chunks_up becomes a hard limit for the input. When the input is paused, records will not be ingested until it is resumed. For some inputs, such as TCP and tail, pausing the input will almost certainly lead to log loss. For the tail input, Calyptia Core Agent can save its current offset in the current file it is reading, and pick back up when the input is resumed.
Look for messages in the Calyptia Core Agent log output like:
[input] tail.1 paused (storage buf overlimit
[input] tail.1 resume (storage buf overlimit

Limiting Filesystem space for Chunks**

Calyptia Core Agent implements the concept of logical queues: based on its Tag, a Chunk can be routed to multiple destinations. Thus, we keep an internal reference from where a Chunk was created and where it needs to go.
It's common to find cases where if we have multiple destinations for a Chunk, one of the destinations might be slower than the other, or maybe one is generating backpressure and not all of them. In this scenario, how do we limit the amount of filesystem Chunks that we are logically queueing?
Starting from Calyptia Core Agent v1.6, we introduced the new configuration property for output plugins called storage.total_limit_size which limits the number of Chunks that exist in the filesystem for a certain logical output destination. If one of the destinations reaches the storage.total_limit_size, the oldest Chunk from its queue for that logical output destination will be discarded.

Configuration

The storage layer configuration takes place in three areas:
  • Service Section
  • Input Section
  • Output Section
The known Service section configures a global environment for the storage layer, the Input sections define which buffering mechanism to use and the output the limits for the logical filesystem queues.

Service Section Configuration

The Service section refers to the section defined in the main configuration file:
Key
Description
Default
storage.path
Set an optional location in the file system to store streams and chunks of data. If this parameter is not set, Input plugins can only use in-memory buffering.
storage.sync
Configure the synchronization mode used to store the data into the file system. It can take the values normal or full. Using full increases the reliability of the filesystem buffer and ensures that data is guaranteed to be synced to the filesystem even if Calyptia Core Agent crashes. On linux, full corresponds with the MAP_SYNC option for memory mapped files.
normal
storage.checksum
Enable the data integrity check when writing and reading data from the filesystem. The storage layer uses the CRC32 algorithm.
Off
storage.max_chunks_up
If the input plugin has enabled filesystem storage type, this property sets the maximum number of Chunks that can be up in memory. *This is the setting to use to control memory usage when you enable storage.type filesystem.
128
storage.backlog.mem_limit
If storage.path is set, Calyptia Core Agent will look for data chunks that were not delivered and are still in the storage layer, these are called backlog data. Backlog chunks are filesystem chunks that were left over from a previous Calyptia Core Agent run; chunks that could not be sent before exit that Calyptia Core Agent will pick up when restarted. Calyptia Core Agent will check the storage.backlog.mem_limit value against the current memory usage from all up chunks for the input. If the up chunks currently consume less memory than the limit, it will bring the backlog chunks up into memory so they can be sent by outputs.
5M
storage.metrics
If http_server option has been enabled in the main [SERVICE] section, this option registers a new endpoint where internal metrics of the storage layer can be consumed. For more details refer to the Monitoring section.
off
storage.delete_irrecoverable_chunks
When enabled, irrecoverable chunks will be deleted during runtime, and any other irrecoverable chunk located in the configured storage path directory will be deleted when Fluent-Bit starts.
Off
A Service section will look like this:
[SERVICE]
flush 1
log_Level info
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
That configuration sets an optional buffering mechanism where the route to the data is /var/log/flb-storage/, it will use normal synchronization mode, without running a checksum and up to a maximum of 5 MB of memory when processing backlog data.

Input section configuration

Optionally, any Input plugin can configure their storage preference, the following table describes the options available:
Key
Description
Default
storage.type
Specifies the buffering mechanism to use. It can be memory or filesystem.
memory
storage.pause_on_chunks_overlimit
Specifies if the input plugin should be paused (stop ingesting new data) when the storage.max_chunks_up value is reached.
off
The following example configures a service that offers filesystem buffering capabilities and two Input plugins being the first based in filesystem and the second with memory only.
[SERVICE]
flush 1
log_Level info
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.max_chunks_up 128
storage.backlog.mem_limit 5M
[INPUT]
name cpu
storage.type filesystem
[INPUT]
name mem
storage.type memory

Output section configuration

If certain chunks are filesystem storage.type based, it's possible to control the size of the logical queue for an output plugin. The following table describes the options available:
Key
Description
Default
storage.total_limit_size
Limit the maximum number of Chunks in the filesystem for the current output logical destination.
The following example create records with CPU usage samples in the filesystem and then they are delivered to Google Stackdriver service limiting the logical queue (buffering) to 5M:
[SERVICE]
flush 1
log_Level info
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.max_chunks_up 128
storage.backlog.mem_limit 5M
[INPUT]
name cpu
storage.type filesystem
[OUTPUT]
name stackdriver
match *
storage.total_limit_size 5M
If for some reason Calyptia Core Agent gets offline because of a network issue, it will continue buffering CPU samples but just keep a maximum of 5M of the newest data.