Observability & Performance Tuning
Hasura Event Triggers Execution
The Hasura Event Triggers system can be segmented into the 2 parts:
Event capture system
Event capture is accomplished via database triggers. A database trigger is created which is invoked whenever there is an
INSERT/UPDATE/DELETE (based on the definition of the event trigger) on the table.
The database trigger captures a per-row change and then writes that to a Hasura Events table. The Hasura Event tables acts as a queue for all pending/in-process events.
Event delivery system
Hasura creates a poller thread, which polls the Hasura Event tables for new/pending events. The poller thread fetches the events in batches (by default 100) and adds them to its in-memory events queue (Hasura Events queue). The polling is paused if all the HTTP workers (defined below) are busy.
Hasura also creates a pool of HTTP workers (by default 100) which are responsible for delivering the events from the events queue to the webhook.
Hasura exposes a set of Prometheus metrics that can be used to monitor the Event Trigger system and help diagnose performance issues.
Event fetch time per batch
Hasura fetches the events in batches (by default 100) from the Hasura Event tables in the database. This metric represents the time taken to fetch a batch of events from the database.
A higher metric indicates slower polling of events from the database, you should consider looking into the performance of your database.
Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10
Event invocations total
This metric represents the number of HTTP requests that have been made to the webhook server for delivering events.
Event processed total
Total number of events processed. Represents the Event Trigger egress.
Event processing time
Time time taken for an event to be processed.
Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100
The processing of an event involves the following steps:
- Hasura Engine fetching the event from Hasura Event tables in the database and adding it to the Hasura Events queue
- An HTTP worker picking up the event from the Hasura Events queue
- An HTTP worker delivering the event to the webhook
Note, if the delivery of the event fails - the delivery of the event is retried based on its
This metric represent the time taken for an event to be delivered since it was created (if the first attempt) or retried (after the first attempt). This metric can be considered as the end-to-end processing time for an event.
For e.g., say an event was created at
2021-01-01 10:00:30 and it has a
next_retry_at configuration which says if the
event delivery fails, the event should be retried after 30 seconds.
2021-01-01 10:01:30: the event was fetched from the Hasura Event tables, picked up by the HTTP worker, and the
delivery was attempted. The delivery failed and the
2021-01-01 10:02:00 was set for the event.
2021-01-01 10:02:00: the event was fetched again from the Hasura Event tables, picked up by the HTTP worker,
and the delivery was attempted at
2021-01-01 10:03: 30. This time, the delivery was successful.
The processing time for the second delivery try would be:
Processing Time = event delivery time - event next retried time
Processing Time =
2021-01-01 10:03:30 -
2021-01-01 10:02:00 =
Event queue time
Hasura fetches the events from the Hasura Event tables in the database and adds it to the Hasura Events queue. The event queue time represents the time taken for an event to be picked up by the HTTP worker after it has been added to the "Events Queue".
Higher value of this metric implies slow event processing. In this case, you can consider increasing the HTTP pool size or optimizing the webhook server.
Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100
Event Triggers HTTP Workers
Current number of active Event Trigger HTTP workers. Compare this number to the HTTP pool size. Consider increasing it if the metric is near the current configured value.
Event webhook processing time
The time between when an HTTP worker picks an event for delivery to the time it sends the event payload to the webhook.
A higher processing time indicates slow webhook, you should try to optimize the event webhook.
Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10
Events fetched per batch
Number of events fetched from the Hasura Event tables in the database per batch. This number should be equal or less than the events fetch batch size.
Since polling the database to continuously check if there are any pending events is an expensive operation, Hasura only polls the database if there are any pending events. This metric can be used to understand if there are any pending events in the Hasura Event Tables.
Note that Hasura only fetches events from the Hasura Event tables if there are any pending events. If there are no pending events, this metric will be 0.
Golden signals for Hasura Event Triggers
You can perform Golden Signals-based system monitoring with Hasura's exported metrics. The following are the golden signals for analyzing Hasura Event Triggers system performance.
Latency for the Hasura Event Triggers system references the total time taken by the graphql engine in delivering the
events. To monitor the latency, you can use the
If the value of this metric is high, it maybe an indication that events are taking longer time to be processed and delivered.
The following are few things you can do to analyze and diagnose the latency issue:
- Investigate DB performance: Check the value of the
hasura_event_fetch_time_per_batch_secondsmetric. If the value of this metric is high, it maybe an indication that the database is slow. Consider optimizing the database.
- Investigate HTTP Worker Saturation: Check the value of the
hasura_event_queue_time_secondsmetric. If the value of this metric is high, it maybe an indication that all the HTTP workers are saturated and are not able to pick up the events from the
Events Queue. Consider increasing the Events HTTP Pool Size
- Investigate Webhook Performance: Check the value of the
hasura_event_webhook_processing_time_secondsmetric. If the value of this metric is high, it maybe an indication that the webhook is slow. Consider optimizing the webhook.
You can also try scaling your Hasura instance horizontally to handle more events.
Saturation is the threshold until which the Hasura Event Triggers system can run smoothly. Once this threshold is crossed, you may see performance issues such as high event processing time, etc.
Saturation for the Event Trigger system references the difference between the incoming events rate and event delivery rate.
Saturation = No. of active HTTP workers / No. of total HTTP workers
To monitor saturation, you can use the following:
- Investigate HTTP Workers: Compare the active HTTP workers
hasura_event_trigger_http_workerswith the Events HTTP Pool Size. Saturation is high if the active HTTP workers is near the HTTP pool size. If HTTP workers are saturated then it maybe also indicate that the
hasura_event_queue_time_secondsis also high. Consider Increasing the number of HTTP workers by increasing the Events HTTP Pool Size
Traffic for Event Triggers means the number of new events created at a given point of time. Since it's complicated to figure out the number of events created, you can use the number of Event Triggers processed as a proxy for traffic.
To monitor traffic, you can use the
If the value of this metric is high (and above your established baseline), and the Hasura Event Triggers system is also
hasura_event_trigger_http_workers nearing the configured HTTP worker pool size and
hasura_event_queue_time_seconds is also high), then you may want to consider doing the following:
- Increasing the number of HTTP workers by increasing the Events HTTP Pool Size
- Scaling your Hasura instance horizontally to handle more events.
Errors for an Event Trigger references the number of event deliveries that failed or errored out.
To monitor errors, you can use the
hasura_event_processed_total metric. You can then filter
the metric using the label
status: failed i.e.,
You can do to the following to analyze and diagnose errors:
- Identify the Event Trigger with a high error rate. You can do this by using the above metric and see the trigger
name associated to that metric in the
- You can then use the
*_get_event_invocation_logsmetadata API to get the error logs for the Event Trigger. This should provide some insights into the error.
Tuning Hasura Event Triggers performance
Hasura Event Triggers are designed to handle of millions of events per hour. However, due to misconfigurations or other reasons, the performance of the Hasura Event Triggers system can be impacted. This section describes how to tune the performance of subscriptions.
Event Trigger processing can be tuned by few server settings as described below:
- The number of events fetched from the
Hasura Event tablesin the database per batch. By default, 100.
- Increasing this will fetch more events from the database per batch, thereby reducing the load on database and improving throughput while increasing individual fetch SQL execution times and, potentially, the memory of the Hasura instance.
- The number of events fetched from the
- The maximum number of HTTP workers that are spawned to deliver events to the webhook. By default, 100.
- Increasing this will spawn more HTTP workers, thereby increasing the number of concurrent event deliveries to the webhook. This may also increase the memory and the CPU usage of the Hasura instance.
- The interval at which Hasura polls the database for new events. By default, 1 second.
- Increasing this reduces frequency of the poll to the database reducing the load on it while increasing the latency of processing of event.