Metrics exported via Prometheus
Hasura exports three types of prometheus metrics:
- Histogram: Represents the distribution of a set of values across a set of buckets. Please note that the histogram
buckets are cumulative. You can read more about the
histogram metric type here. For example
hasura_event_webhook_processing_time_seconds
is a histogram metric. - Counter: Represents a cumulative metric that represents a single monotonically increasing counter whose value can only
increase or be reset to zero on restart. You can read more about the counter metric type
here. For example
hasura_graphql_requests_total
is a counter metric. - Gauge: Represents a single numerical value that can arbitrarily go up and down. You can read more about the gauge
metric type here. For example
hasura_active_subscriptions
is a gauge metric.
Metrics exported
The following metrics are exported by Hasura GraphQL Engine:
GraphQL request metrics
Hasura GraphQL execution time seconds
Execution time of successful GraphQL requests (excluding subscriptions). If more requests are falling in the higher buckets, you should consider tuning the performance.
Name | hasura_graphql_execution_time_seconds |
Type | Histogram Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
Labels | operation_type : query | mutation |
Unit | seconds |
- Uses wall-clock time, so it includes time spent waiting on I/O.
- Includes authorization, parsing, validation, planning, and execution (calls to databases, Remote Schemas).
Hasura GraphQL requests total
Number of GraphQL requests received, representing the GraphQL query/mutation traffic on the server.
Name | hasura_graphql_requests_total |
Type | Counter |
Labels | operation_type : query | mutation | subscription | unknown, response_status : success | failed, operation_name , parameterized_query_hash |
The unknown
operation type will be returned for queries that fail authorization, parsing, or certain validations. The
response_status
label will be success
for successful requests and failed
for failed requests.
Hasura Event Triggers metrics
See more details on Event trigger observability here.
Event fetch time per batch
Hasura fetches the events in batches (by default 100) from the Hasura Event tables in the database. This metric represents the time taken to fetch a batch of events from the database.
A higher metric indicates slower polling of events from the database, you should consider looking into the performance of your database.
Name | hasura_event_fetch_time_per_batch_seconds |
Type | Histogram Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
Labels | none |
Unit | seconds |
Event invocations total
This metric represents the number of HTTP requests that have been made to the webhook server for delivering events.
Name | hasura_event_invocations_total |
Type | Counter |
Labels | status : success | failed, trigger_name , source_name |
Event processed total
Total number of events processed. Represents the Event Trigger egress.
Name | hasura_event_processed_total |
Type | Counter |
Labels | status : success | failed, trigger_name , source_name |
Event processing time
Time taken for an event to be processed.
Name | hasura_event_processing_time_seconds |
Type | Histogram Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | trigger_name , source_name |
Unit | seconds |
The processing of an event involves the following steps:
- Hasura Engine fetching the event from Hasura Event tables in the database and adding it to the Hasura Events queue
- An HTTP worker picking up the event from the Hasura Events queue
- An HTTP worker delivering the event to the webhook
Note, if the delivery of the event fails - the delivery of the event is retried based on its next_retry_at
configuration.
This metric represents the time taken for an event to be delivered since it was created (if the first attempt) or retried (after the first attempt). This metric can be considered as the end-to-end processing time for an event.
For e.g., say an event was created at 2021-01-01 10:00:30
and it has a next_retry_at
configuration which says if the
event delivery fails, the event should be retried after 30 seconds.
At 2021-01-01 10:01:30
: the event was fetched from the Hasura Event tables, picked up by the HTTP worker, and the
delivery was attempted. The delivery failed and the next_retry_at
of 2021-01-01 10:02:00
was set for the event.
Now at 2021-01-01 10:02:00
: the event was fetched again from the Hasura Event tables, picked up by the HTTP worker,
and the delivery was attempted at 2021-01-01 10:03: 30
. This time, the delivery was successful.
The processing time for the second delivery try would be:
Processing Time = event delivery time - event next retried time
Processing Time = 2021-01-01 10:03:30
- 2021-01-01 10:02:00
= 90 seconds
Event queue time
Hasura fetches the events from the Hasura Event tables in the database and adds it to the Hasura Events queue. The event queue time represents the time taken for an event to be picked up by the HTTP worker after it has been added to the "Events Queue".
Higher value of this metric implies slow event processing. In this case, you can consider increasing the HTTP pool size or optimizing the webhook server.
Name | hasura_event_queue_time_seconds |
Type | Histogram Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | trigger_name , source_name |
Unit | seconds |
Event Triggers HTTP Workers
Current number of active Event Trigger HTTP workers. Compare this number to the HTTP pool size. Consider increasing it if the metric is near the current configured value.
Name | hasura_event_trigger_http_workers |
Type | Gauge |
Labels | none |
Event webhook processing time
The time between when an HTTP worker picks an event for delivery to the time it sends the event payload to the webhook.
A higher processing time indicates slow webhook, you should try to optimize the event webhook.
Name | hasura_event_webhook_processing_time_seconds |
Type | Histogram Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
Labels | trigger_name , source_name |
Unit | seconds |
Events fetched per batch
Number of events fetched from the Hasura Event tables in the database per batch. This number should be equal or less than the events fetch batch size.
Name | hasura_events_fetched_per_batch |
Type | Gauge |
Labels | none |
Since polling the database to continuously check if there are any pending events is an expensive operation, Hasura only polls the database if there are any pending events. This metric can be used to understand if there are any pending events in the Hasura Event Tables.
Note that Hasura only fetches events from the Hasura Event tables if there are any pending events. If there are no pending events, this metric will be 0.
Subscription metrics
See more details on subscriptions observability here.
Active Subscriptions
Current number of active subscriptions, representing the subscription load on the server.
Name | hasura_active_subscriptions |
Type | Gauge |
Labels | subscription_kind : streaming | live-query, operation_name , parameterized_query_hash |
Active Subscription Pollers
Current number of active subscription pollers. A subscription poller multiplexes similar subscriptions together. The value of this metric should be proportional to the number of uniquely parameterized subscriptions (i.e., subscriptions with the same selection set, but with different input arguments and session variables are multiplexed on the same poller). If this metric is high then it may be an indication that there are too many uniquely parameterized subscriptions which could be optimized for better performance.
Name | hasura_active_subscription_pollers |
Type | Gauge |
Labels | subscription_kind : streaming | live-query |
Active Subscription Pollers in Error State
Current number of active subscription pollers that are in the error state. A subscription poller multiplexes similar subscriptions together. A non-zero value of this metric indicates that there are runtime errors in atleast one of the subscription pollers that are running in Hasura. In most of the cases, runtime errors in subscriptions are caused due to the changes at the data model layer and fixing the issue at the data model layer should automatically fix the runtime errors.
Name | hasura_active_subscription_pollers_in_error_state |
Type | Gauge |
Labels | subscription_kind : streaming | live-query |
Subscription Total Time
The time taken to complete running of one subscription poller.
A subscription poller multiplexes similar subscriptions together. This subscription poller runs every 1 second by default and queries the database with the multiplexed query to fetch the latest data. In a polling instance, the poller not only queries the database but does other operations like splitting similar queries into batches (by default 100) before fetching their data from the database, etc. This metric is the total time taken to complete all the operations in a single poll.
In a single poll, the subscription poller splits the different variables for the multiplexed query into batches (by
default 100) and executes the batches. We use the hasura_subscription_db_execution_time_seconds
metric to observe the
time taken for each batch to execute on the database. This means for a single poll there can be multiple values for
hasura_subscription_db_execution_time_seconds
metric.
Let's look at an example to understand these metrics better:
Say we have 650 subscriptions with the same selection set but different input arguments. These 650 subscriptions will be grouped to form one multiplexed query. A single poller would be created to run this multiplexed query. This poller will run every 1 second.
The default batch size in hasura is 100, so the 650 subscriptions will be split into 7 batches for execution during a single poll.
During a single poll:
- Batch 1:
hasura_subscription_db_execution_time_seconds
= 0.002 seconds - Batch 2:
hasura_subscription_db_execution_time_seconds
= 0.001 seconds - Batch 3:
hasura_subscription_db_execution_time_seconds
= 0.003 seconds - Batch 4:
hasura_subscription_db_execution_time_seconds
= 0.001 seconds - Batch 5:
hasura_subscription_db_execution_time_seconds
= 0.002 seconds - Batch 6:
hasura_subscription_db_execution_time_seconds
= 0.001 seconds - Batch 7:
hasura_subscription_db_execution_time_seconds
= 0.002 seconds
The hasura_subscription_total_time_seconds
would be sum of all the database execution times shown in the batches, plus
some extra process time for other tasks the poller does during a single poll. In this case, it would be approximately
0.013 seconds.
Name | hasura_subscription_total_time_seconds |
Type | Histogram Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | subscription_kind : streaming | live-query, operation_name , parameterized_query_hash |
Unit | seconds |
Subscription Database Execution Time
The time taken to run the subscription's multiplexed query in the database for a single batch.
A subscription poller multiplexes similar subscriptions together. During every run (every 1 second by default), the poller splits the different variables for the multiplexed query into batches (by default 100) and execute the batches. This metric observes the time taken for each batch to execute on the database.
If this metric is high, then it may be an indication that the database is not performing as expected, you should consider investigating the subscription query and see if indexes can help improve performance.
Name | hasura_subscription_db_execution_time_seconds |
Type | Histogram Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | subscription_kind : streaming | live-query, operation_name , parameterized_query_hash |
Unit | seconds |
WebSocket Egress
The total size of WebSocket messages sent in bytes.
Name | hasura_websocket_messages_sent_bytes_total |
Type | Counter |
Labels | operation_name , parameterized_query_hash |
Unit | bytes |
WebSocket Ingress
The total size of WebSocket messages received in bytes.
Name | hasura_websocket_messages_received_bytes_total |
Type | Counter |
Labels | none |
Unit | bytes |
Websocket Message Queue Time
The time for which a websocket message remains queued in the GraphQL engine's websocket queue.
Name | hasura_websocket_message_queue_time |
Type | Histogram Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | none |
Unit | seconds |
Websocket Message Write Time
The time taken to write a websocket message into the TCP send buffer.
Name | hasura_websocket_message_write_time |
Type | Histogram Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | none |
Unit | seconds |
Cache metrics
See more details on caching metrics here
Hasura cache request count
Total number of cache hit and miss requests. This helps in monitoring and optimizing cache utilization.
Name | hasura_cache_request_count |
Type | Counter |
Labels | status : hit | miss |
Cron trigger metrics
Hasura cron events invocation total
Total number of cron events invoked, representing the number of invocations made for cron events.
Name | hasura_cron_events_invocation_total |
Type | Counter |
Labels | status : success | failed |
Hasura cron events processed total
Total number of cron events processed, representing the number of invocations made for cron events. Compare this to
hasura_cron_events_invocation_total
. A high difference between the two metrics indicates high failure rate of the cron
webhook.
Name | hasura_cron_events_processed_total |
Type | Counter |
Labels | status : success | failed |
One-off Scheduled events metrics
Hasura one-off events invocation total
Total number of one-off events invoked, representing the number of invocations made for one-off events.
Name | hasura_oneoff_events_invocation_total |
Type | Counter |
Labels | status : success | failed |
Hasura one-off events processed total
Total number of one-off events processed, representing the number of invocations made for one-off events. Compare this
to hasura_oneoff_events_invocation_total
. A high difference between the two metrics indicates high failure rate of the
one-off webhook.
Name | hasura_oneoff_events_processed_total |
Type | Counter |
Labels | status : success | failed |
Hasura HTTP connections
Current number of active HTTP connections (excluding WebSocket connections), representing the HTTP load on the server.
Name | hasura_http_connections |
Type | Gauge |
Labels | none |
Hasura WebSocket connections
Current number of active WebSocket connections, representing the WebSocket load on the server.
Name | hasura_websocket_connections |
Type | Gauge |
Labels | none |
Hasura Postgres connections
Current number of active PostgreSQL connections. Compare this to pool settings.
Name | hasura_postgres_connections |
Type | Gauge |
Labels | source_name : name of the databaseconn_info : connection url string (password omitted) or name of the connection url environment variablerole : primary | replica |
Hasura Postgres Connection Initialization Time
The time taken to establish and initialize a PostgreSQL connection.
Name | hasura_postgres_connection_init_time |
Type | Histogram Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | source_name : name of the databaseconn_info : connection url string (password omitted) or name of the connection url environment variablerole : primary | replica |
Unit | seconds |
Hasura Postgres Pool Wait Time
The time taken to acquire a connection from the pool.
Name | hasura_postgres_pool_wait_time |
Type | Histogram Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100 |
Labels | source_name : name of the databaseconn_info : connection url string (password omitted) or name of the connection url environment variablerole : primary | replica |
Unit | seconds |
Hasura source health
Health check status of a particular data source, corresponding to the output of /healthz/sources
, with possible values
0 through 3 indicating, respectively: OK, TIMEOUT, FAILED, ERROR. See the
Source Health Check API Reference for details.
Name | hasura_source_health |
Type | Gauge |
Labels | source_name : name of the database |
HTTP Egress
Total size of HTTP response bodies sent via the HTTP server excluding responses from requests to /healthz
and /v1/version
endpoints or any other undefined resource/endpoint (for example /foobar
).
Name | hasura_http_response_bytes_total |
Type | Counter |
Labels | none |
Unit | bytes |
HTTP Ingress
Total size of HTTP request bodies received via the HTTP server excluding requests to /healthz
and
/v1/version
endpoints or any other undefined resource/endpoint (for example /foobar
).
Name | hasura_http_request_bytes_total |
Type | Counter |
Labels | none |
Unit | bytes |
OpenTelemetry OTLP Export Metrics
These metrics allow for monitoring the reliability and performance of OTLP exports of telemetry data.
Hasura OTLP Sent Spans
Total number of successfully exported trace spans.
Name | hasura_otel_sent_spans |
Type | Counter |
Labels | none |
Hasura OTLP Dropped Spans
Total number of trace spans dropped due to either high trace volume that filled the buffer, or errors during send (e.g. a timeout or error response from the collector).
Name | hasura_otel_dropped_spans |
Type | Counter |
Labels | reason : buffer_full | send_failed |