Skip to main content
Version: v2.x

Metrics exported via Prometheus

Hasura exports three types of prometheus metrics:

  • Histogram: Represents the distribution of a set of values across a set of buckets. Please note that the histogram buckets are cumulative. You can read more about the histogram metric type here. For example hasura_event_webhook_processing_time_seconds is a histogram metric.
  • Counter: Represents a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. You can read more about the counter metric type here. For example hasura_graphql_requests_total is a counter metric.
  • Gauge: Represents a single numerical value that can arbitrarily go up and down. You can read more about the gauge metric type here. For example hasura_active_subscriptions is a gauge metric.

Metrics exported

The following metrics are exported by Hasura GraphQL Engine:

GraphQL request metrics

Hasura GraphQL execution time seconds

Execution time of successful GraphQL requests (excluding subscriptions). If more requests are falling in the higher buckets, you should consider tuning the performance.

Namehasura_graphql_execution_time_seconds
TypeHistogram

Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10
Labelsoperation_type: query | mutation
Unitseconds
GraphQL request execution time
  • Uses wall-clock time, so it includes time spent waiting on I/O.
  • Includes authorization, parsing, validation, planning, and execution (calls to databases, Remote Schemas).

Hasura GraphQL requests total

Number of GraphQL requests received, representing the GraphQL query/mutation traffic on the server.

Namehasura_graphql_requests_total
TypeCounter
Labelsoperation_type: query | mutation | subscription | unknown, response_status: success | failed, operation_name, parameterized_query_hash

The unknown operation type will be returned for queries that fail authorization, parsing, or certain validations. The response_status label will be success for successful requests and failed for failed requests.

Hasura Event Triggers metrics

See more details on Event trigger observability here.

Event fetch time per batch

Hasura fetches the events in batches (by default 100) from the Hasura Event tables in the database. This metric represents the time taken to fetch a batch of events from the database.

A higher metric indicates slower polling of events from the database, you should consider looking into the performance of your database.

Namehasura_event_fetch_time_per_batch_seconds
TypeHistogram

Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10
Labelsnone
Unitseconds

Event invocations total

This metric represents the number of HTTP requests that have been made to the webhook server for delivering events.

Namehasura_event_invocations_total
TypeCounter
Labelsstatus: success | failed, trigger_name, source_name

Event processed total

Total number of events processed. Represents the Event Trigger egress.

Namehasura_event_processed_total
TypeCounter
Labelsstatus: success | failed, trigger_name, source_name

Event processing time

Time taken for an event to be processed.

Namehasura_event_processing_time_seconds
TypeHistogram

Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100
Labelstrigger_name, source_name
Unitseconds

The processing of an event involves the following steps:

  1. Hasura Engine fetching the event from Hasura Event tables in the database and adding it to the Hasura Events queue
  2. An HTTP worker picking up the event from the Hasura Events queue
  3. An HTTP worker delivering the event to the webhook
Event delivery failure

Note, if the delivery of the event fails - the delivery of the event is retried based on its next_retry_at configuration.

This metric represents the time taken for an event to be delivered since it was created (if the first attempt) or retried (after the first attempt). This metric can be considered as the end-to-end processing time for an event.

For e.g., say an event was created at 2021-01-01 10:00:30 and it has a next_retry_at configuration which says if the event delivery fails, the event should be retried after 30 seconds.

At 2021-01-01 10:01:30: the event was fetched from the Hasura Event tables, picked up by the HTTP worker, and the delivery was attempted. The delivery failed and the next_retry_at of 2021-01-01 10:02:00 was set for the event.

Now at 2021-01-01 10:02:00: the event was fetched again from the Hasura Event tables, picked up by the HTTP worker, and the delivery was attempted at 2021-01-01 10:03: 30. This time, the delivery was successful.

The processing time for the second delivery try would be:

Processing Time = event delivery time - event next retried time

Processing Time = 2021-01-01 10:03:30 - 2021-01-01 10:02:00 = 90 seconds

Event queue time

Hasura fetches the events from the Hasura Event tables in the database and adds it to the Hasura Events queue. The event queue time represents the time taken for an event to be picked up by the HTTP worker after it has been added to the "Events Queue".

Higher value of this metric implies slow event processing. In this case, you can consider increasing the HTTP pool size or optimizing the webhook server.

Namehasura_event_queue_time_seconds
TypeHistogram

Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100
Labelstrigger_name, source_name
Unitseconds

Event Triggers HTTP Workers

Current number of active Event Trigger HTTP workers. Compare this number to the HTTP pool size. Consider increasing it if the metric is near the current configured value.

Namehasura_event_trigger_http_workers
TypeGauge
Labelsnone

Event webhook processing time

The time between when an HTTP worker picks an event for delivery to the time it sends the event payload to the webhook.

A higher processing time indicates slow webhook, you should try to optimize the event webhook.

Namehasura_event_webhook_processing_time_seconds
TypeHistogram

Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10
Labelstrigger_name, source_name
Unitseconds

Events fetched per batch

Number of events fetched from the Hasura Event tables in the database per batch. This number should be equal or less than the events fetch batch size.

Namehasura_events_fetched_per_batch
TypeGauge
Labelsnone

Since polling the database to continuously check if there are any pending events is an expensive operation, Hasura only polls the database if there are any pending events. This metric can be used to understand if there are any pending events in the Hasura Event Tables.

Dependent on pending events

Note that Hasura only fetches events from the Hasura Event tables if there are any pending events. If there are no pending events, this metric will be 0.

Subscription metrics

See more details on subscriptions observability here.

Active Subscriptions

Current number of active subscriptions, representing the subscription load on the server.

Namehasura_active_subscriptions
TypeGauge
Labelssubscription_kind: streaming | live-query, operation_name, parameterized_query_hash

Active Subscription Pollers

Current number of active subscription pollers. A subscription poller multiplexes similar subscriptions together. The value of this metric should be proportional to the number of uniquely parameterized subscriptions (i.e., subscriptions with the same selection set, but with different input arguments and session variables are multiplexed on the same poller). If this metric is high then it may be an indication that there are too many uniquely parameterized subscriptions which could be optimized for better performance.

Namehasura_active_subscription_pollers
TypeGauge
Labelssubscription_kind: streaming | live-query

Active Subscription Pollers in Error State

Current number of active subscription pollers that are in the error state. A subscription poller multiplexes similar subscriptions together. A non-zero value of this metric indicates that there are runtime errors in atleast one of the subscription pollers that are running in Hasura. In most of the cases, runtime errors in subscriptions are caused due to the changes at the data model layer and fixing the issue at the data model layer should automatically fix the runtime errors.

Namehasura_active_subscription_pollers_in_error_state
TypeGauge
Labelssubscription_kind: streaming | live-query

Subscription Total Time

The time taken to complete running of one subscription poller.

A subscription poller multiplexes similar subscriptions together. This subscription poller runs every 1 second by default and queries the database with the multiplexed query to fetch the latest data. In a polling instance, the poller not only queries the database but does other operations like splitting similar queries into batches (by default 100) before fetching their data from the database, etc. This metric is the total time taken to complete all the operations in a single poll.

In a single poll, the subscription poller splits the different variables for the multiplexed query into batches (by default 100) and executes the batches. We use the hasura_subscription_db_execution_time_seconds metric to observe the time taken for each batch to execute on the database. This means for a single poll there can be multiple values for hasura_subscription_db_execution_time_seconds metric.

Let's look at an example to understand these metrics better:

Say we have 650 subscriptions with the same selection set but different input arguments. These 650 subscriptions will be grouped to form one multiplexed query. A single poller would be created to run this multiplexed query. This poller will run every 1 second.

The default batch size in hasura is 100, so the 650 subscriptions will be split into 7 batches for execution during a single poll.

During a single poll:

  • Batch 1: hasura_subscription_db_execution_time_seconds = 0.002 seconds
  • Batch 2: hasura_subscription_db_execution_time_seconds = 0.001 seconds
  • Batch 3: hasura_subscription_db_execution_time_seconds = 0.003 seconds
  • Batch 4: hasura_subscription_db_execution_time_seconds = 0.001 seconds
  • Batch 5: hasura_subscription_db_execution_time_seconds = 0.002 seconds
  • Batch 6: hasura_subscription_db_execution_time_seconds = 0.001 seconds
  • Batch 7: hasura_subscription_db_execution_time_seconds = 0.002 seconds

The hasura_subscription_total_time_seconds would be sum of all the database execution times shown in the batches, plus some extra process time for other tasks the poller does during a single poll. In this case, it would be approximately 0.013 seconds.

Namehasura_subscription_total_time_seconds
TypeHistogram

Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100
Labelssubscription_kind: streaming | live-query, operation_name, parameterized_query_hash
Unitseconds

Subscription Database Execution Time

The time taken to run the subscription's multiplexed query in the database for a single batch.

A subscription poller multiplexes similar subscriptions together. During every run (every 1 second by default), the poller splits the different variables for the multiplexed query into batches (by default 100) and execute the batches. This metric observes the time taken for each batch to execute on the database.

If this metric is high, then it may be an indication that the database is not performing as expected, you should consider investigating the subscription query and see if indexes can help improve performance.

Namehasura_subscription_db_execution_time_seconds
TypeHistogram

Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100
Labelssubscription_kind: streaming | live-query, operation_name, parameterized_query_hash
Unitseconds

WebSocket Egress

The total size of WebSocket messages sent in bytes.

Namehasura_websocket_messages_sent_bytes_total
TypeCounter
Labelsoperation_name, parameterized_query_hash
Unitbytes

WebSocket Ingress

The total size of WebSocket messages received in bytes.

Namehasura_websocket_messages_received_bytes_total
TypeCounter
Labelsnone
Unitbytes

Websocket Message Queue Time

The time for which a websocket message remains queued in the GraphQL engine's websocket queue.

Namehasura_websocket_message_queue_time
TypeHistogram

Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100
Labelsnone
Unitseconds

Websocket Message Write Time

The time taken to write a websocket message into the TCP send buffer.

Namehasura_websocket_message_write_time
TypeHistogram

Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100
Labelsnone
Unitseconds

Cache metrics

See more details on caching metrics here

Hasura cache request count

Total number of cache hit and miss requests. This helps in monitoring and optimizing cache utilization.

Namehasura_cache_request_count
TypeCounter
Labelsstatus: hit | miss

Cron trigger metrics

Hasura cron events invocation total

Total number of cron events invoked, representing the number of invocations made for cron events.

Namehasura_cron_events_invocation_total
TypeCounter
Labelsstatus: success | failed

Hasura cron events processed total

Total number of cron events processed, representing the number of invocations made for cron events. Compare this to hasura_cron_events_invocation_total. A high difference between the two metrics indicates high failure rate of the cron webhook.

Namehasura_cron_events_processed_total
TypeCounter
Labelsstatus: success | failed

One-off Scheduled events metrics

Hasura one-off events invocation total

Total number of one-off events invoked, representing the number of invocations made for one-off events.

Namehasura_oneoff_events_invocation_total
TypeCounter
Labelsstatus: success | failed

Hasura one-off events processed total

Total number of one-off events processed, representing the number of invocations made for one-off events. Compare this to hasura_oneoff_events_invocation_total. A high difference between the two metrics indicates high failure rate of the one-off webhook.

Namehasura_oneoff_events_processed_total
TypeCounter
Labelsstatus: success | failed

Hasura HTTP connections

Current number of active HTTP connections (excluding WebSocket connections), representing the HTTP load on the server.

Namehasura_http_connections
TypeGauge
Labelsnone

Hasura WebSocket connections

Current number of active WebSocket connections, representing the WebSocket load on the server.

Namehasura_websocket_connections
TypeGauge
Labelsnone

Hasura Postgres connections

Current number of active PostgreSQL connections. Compare this to pool settings.

Namehasura_postgres_connections
TypeGauge
Labelssource_name: name of the database
conn_info: connection url string (password omitted) or name of the connection url environment variable
role: primary | replica

Hasura Postgres Connection Initialization Time

The time taken to establish and initialize a PostgreSQL connection.

Namehasura_postgres_connection_init_time
TypeHistogram

Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100
Labelssource_name: name of the database
conn_info: connection url string (password omitted) or name of the connection url environment variable
role: primary | replica
Unitseconds

Hasura Postgres Pool Wait Time

The time taken to acquire a connection from the pool.

Namehasura_postgres_pool_wait_time
TypeHistogram

Buckets: 0.000001, 0.0001, 0.01, 0.1, 0.3, 1, 3, 10, 30, 100
Labelssource_name: name of the database
conn_info: connection url string (password omitted) or name of the connection url environment variable
role: primary | replica
Unitseconds

Hasura source health

Health check status of a particular data source, corresponding to the output of /healthz/sources, with possible values 0 through 3 indicating, respectively: OK, TIMEOUT, FAILED, ERROR. See the Source Health Check API Reference for details.

Namehasura_source_health
TypeGauge
Labelssource_name: name of the database

HTTP Egress

Total size of HTTP response bodies sent via the HTTP server excluding responses from requests to /healthz and /v1/version endpoints or any other undefined resource/endpoint (for example /foobar).

Namehasura_http_response_bytes_total
TypeCounter
Labelsnone
Unitbytes

HTTP Ingress

Total size of HTTP request bodies received via the HTTP server excluding requests to /healthz and /v1/version endpoints or any other undefined resource/endpoint (for example /foobar).

Namehasura_http_request_bytes_total
TypeCounter
Labelsnone
Unitbytes

OpenTelemetry OTLP Export Metrics

These metrics allow for monitoring the reliability and performance of OTLP exports of telemetry data.

Hasura OTLP Sent Spans

Total number of successfully exported trace spans.

Namehasura_otel_sent_spans
TypeCounter
Labelsnone

Hasura OTLP Dropped Spans

Total number of trace spans dropped due to either high trace volume that filled the buffer, or errors during send (e.g. a timeout or error response from the collector).

Namehasura_otel_dropped_spans
TypeCounter
Labelsreason: buffer_full | send_failed