Monitoring
Learn how to monitor your Calyptia Core Agent data pipelines
Calyptia Core Agent comes with built-it features to allow you to monitor the internals of your pipeline, connect to Prometheus and Grafana, Health checks and also connectors to use external services for such purposes:
HTTP Server
Calyptia Core Agent comes with a built-in HTTP Server that can be used to query internal information and monitor metrics of each running plugin.
The monitoring interface can be easily integrated with Prometheus since we support it native format.
Getting Started
To get started, the first step is to enable the HTTP Server from the configuration file:
the above configuration snippet will instruct Calyptia Core Agent to start it HTTP Server on TCP Port 2020 and listening on all network interfaces:
now with a simple curl command is enough to gather some information:
Note that we are sending the curl command output to the jq program which helps to make the JSON data easy to read from the terminal. Calyptia Core Agent don't aim to do JSON pretty-printing.
REST API Interface
Calyptia Core Agent aims to expose useful interfaces for monitoring, as of Calyptia Core Agent v22.10.03 the following end points are available:
URI | Description | Data Format |
---|---|---|
/ | Calyptia Core Agent build information | JSON |
/api/v1/uptime | Get uptime information in seconds and human readable format | JSON |
/api/v1/metrics | Internal metrics per loaded plugin | JSON |
/api/v1/metrics/prometheus | Internal metrics per loaded plugin ready to be consumed by a Prometheus Server | Prometheus Text 0.0.4 |
/api/v1/storage | Get internal metrics of the storage layer / buffered data. This option is enabled only if in the | JSON |
/api/v1/health | Calyptia Core Agent health check result | String |
Metric descriptions
The following are detailed descriptions for the metrics output in Prometheus format by /api/v1/metrics/prometheus
.
The following definitions are key to understand:
record: a single message collected from a source, such as a single long line in a file.
chunk: Calyptia Core Agent input plugin instances ingest log records and store them in chunks. A batch of records in a chunk are tracked together as a single unit; the Calyptia Core Agent engine attempts to fit records into chunks of at most 2 MB, but the size can vary at runtime. Chunks are then sent to an output. An output plugin instance can either successfully send the full chunk to the destination and mark it as successful, or it can fail the chunk entirely if an unrecoverable error is encountered, or it can ask for the chunk to be retried.
Metric Name | Labels | Description | Type | Unit |
---|---|---|---|---|
fluentbit_input_bytes_total | name: the name or alias for the input instance | The number of bytes of log records that this input instance has successfully ingested | counter | bytes |
fluentbit_input_records_total | name: the name or alias for the input instance | The number of log records this input has successfully ingested | counter | records |
fluentbit_output_dropped_records_total | name: the name or alias for the output instance | The number of log records that have been dropped by the output. This means they met an unrecoverable error or retries expired for their chunk. | counter | records |
fluentbit_output_errors_total | name: the name or alias for the output instance | The number of chunks that have faced an error (either unrecoverable or retriable). This is the number of times a chunk has failed, and does not correspond with the number of error messages you see in the Calyptia Core Agent log output. | counter | chunks |
fluentbit_output_proc_bytes_total | name: the name or alias for the output instance | The number of bytes of log records that this output instance has successfully sent. This is the total byte size of all unique chunks sent by this output. If a record is not sent due to some error, then it will not count towards this metric. | counter | bytes |
fluentbit_output_proc_records_total | name: the name or alias for the output instance | The number of log records that this output instance has successfully sent. This is the total record count of all unique chunks sent by this output. If a record is not successfully sent, it does not count towards this metric. | counter | records |
fluentbit_output_retried_records_total | name: the name or alias for the output instance | The number of log records that experienced a retry. Note that this is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin may or may not perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
fluentbit_output_retries_failed_total | name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a Retry_Limit which applies to chunks. Once the Retry_Limit has been reached for a chunk it is discarded and this metric is incremented. | counter | chunks |
fluentbit_output_retries_total | name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
fluentbit_uptime | The number of seconds that Calyptia Core Agent has been running. | counter | seconds | |
process_start_time_seconds | The Unix Epoch time stamp for when Calyptia Core Agent started. | gauge | seconds |
The following are detailed descriptions for the metrics outputted in JSON format by /api/v1/storage
.
Metric Key | Description | Unit |
---|---|---|
chunks.total_chunks | The total number of chunks of records that Calyptia Core Agent is currently buffering | chunks |
chunks.mem_chunks | The total number of chunks that are buffered in memory at this time. Note that chunks can be both in memory and on the file system at the same time. | chunks |
chunks.fs_chunks | The total number of chunks saved to the filesystem. | chunks |
chunks.fs_chunks_up | A chunk is "up" if it is in memory. So this is the count of chunks that are both in filesystem and in memory. | chunks |
chunks.fs_chunks_down | The count of chunks that are "down" and thus are only in the filesystem. | chunks |
input_chunks.{plugin name}.status.overlimit | Is this input instance over its configured Mem_Buf_Limit? | boolean |
input_chunks.{plugin name}.status.mem_size | The size of memory that this input is consuming to buffer logs in chunks. | bytes |
input_chunks.{plugin name}.status.mem_limit | The buffer memory limit (Mem_Buf_Limit) that applies to this input plugin. | bytes |
input_chunks.{plugin name}.chunks.total | The current total number of chunks owned by this input instance. | chunks |
input_chunks.{plugin name}.chunks.up | The current number of chunks that are "up" in memory for this input. Chunks that are "up" will also be in the filesystem layer as well if filesystem storage is enabled. | chunks |
input_chunks.{plugin name}.chunks.down | The current number of chunks that are "down" in the filesystem for this input. | chunks |
input_chunks.{plugin name}.chunks.busy | "Busy" chunks are chunks that are being processed/sent by outputs and are not eligible to have new data appended. | chunks |
input_chunks.{plugin name}.chunks.busy_size | The sum of the byte size of each chunk which is currently marked as busy. | bytes |
Uptime example
Query the service uptime with the following command:
It should print a similar output like this:
Metrics examples
Query internal metrics in JSON format with the following command:
it should print a similar output like this:
Metrics in Prometheus format
Query internal metrics in Prometheus Text 0.0.4 format:
this time the same metrics will be in Prometheus format instead of JSON:
Configuring aliases
By default configured plugins on runtime get an internal name in the format plugin_name.ID. For monitoring purposes, this can be confusing if many plugins of the same type were configured. To make a distinction each configured input or output section can get an alias that will be used as the parent name for the metric.
The following example set an alias to the INPUT section which is using the CPU input plugin:
Now when querying the metrics we get the aliases in place instead of the plugin name:
Grafana Dashboard and Alerts
The exposed Prometheus-style metrics for Calyptia Core Agent can be leveraged to create dashboards and alerts.
The provided example dashboard is heavily inspired by the Banzai Cloud logging operator dashboard but with a few key differences, such as the use of the instance
label (see why here), stacked graphs, and a focus on Calyptia Core Agent metrics.
Alerts
Sample alerts are available here.
Health Check for Calyptia Core Agent
Calyptia Core Agent supports four configs to set up the health check:
Config Name | Description | Default Value |
---|---|---|
Health_Check | enable Health check feature | Off |
HC_Errors_Count | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for output error: | 5 |
HC_Retry_Failure_Count | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for retry failure: | 5 |
HC_Period | The time period by second to count the error and retry failure data point | 60 |
Note: Not every error log means an error nor be counted, the errors retry failures count only on specific errors which is the example in config table description
So the feature works as: Based on the HC_Period customer setup, if the real error number is over HC_Errors_Count
or retry failure is over HC_Retry_Failure_Count
, Calyptia Core Agent will be considered as unhealthy. The health endpoint will return HTTP status 500 and String error
. Otherwise it's healthy, will return HTTP status 200 and string ok
The equation is:
Note: the HC_Errors_Count and HC_Retry_Failure_Count only count for output plugins and count a sum for errors and retry failures from all output plugins which is running.
See the config example:
The command to call health endpoint
Based on the Calyptia Core Agent status, the result will be:
HTTP status 200 and "ok" in response to healthy status
HTTP status 500 and "error" in response for unhealthy status
With the example config, the health status is determined by following equation:
If (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds is TRUE, then it's unhealthy.
If (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds is FALSE, then it's healthy.
Calyptia Cloud
Calyptia Cloud is a hosted service that allows you to monitor your Calyptia Core Agent instances, including data flow, metrics and configurations.
Get Started with Calyptia Cloud
Register your Calyptia Core Agent instances will take less than one minute, steps:
Go to cloud.calyptia.com and sign in
On the left menu click on Settings and generate/copy your API key
In your Calyptia Core Agent configuration file, append the following configuration section:
Make sure to replace your API key in the configuration.
After a few seconds, upon restart your Calyptia Core Agent, the Calyptia Cloud Dashboard will list your agent. Metrics will take around 30 seconds to shows up.
Contact Calyptia
To get in touch with Calyptia team, just send an email to hello@calyptia.com
Last updated