New observability dashboard for Hasura Cloud

As Hasura Cloud continues to evolve, so do the needs of our users. With thousands of projects relying on Hasura’s powerful GraphQL engine, the demand for a more scalable and robust observability system has become increasingly apparent. After years of using our original monitoring system, we’re excited to introduce a new observability architecture designed to provide better insight into project performance, simplify troubleshooting, and address the growing challenges of scaling.

In this blog post, we’ll explore the features and limitations of the current monitoring setup, how we’re leveraging Grafana, Prometheus, and Hasura’s Data Delivery Network (DDN) to offer a more comprehensive solution, and what this means for the future of monitoring in Hasura Cloud v2.

The challenges with the existing monitoring system

The current monitoring system for Hasura Cloud v2 has been in use for over five years, originating when the enterprise version of GraphQL Engine v1 was first introduced. It primarily relies on log ingestion to derive and calculate metrics, with the analytics database being TimescaleDB v1 due to PostgreSQL’s compatibility with the GraphQL Engine. The advantage of this setup was that we didn’t need to manually write APIs, as Hasura handled everything for us.

However, over the past five years, as Hasura Cloud has grown, several scalability issues have emerged with the existing system:

TimescaleDB is hard to scale horizontally. Our system has to ingest 100+ terabytes of logs every day. The I/O speed can’t satisfy the write-intensive data, especially with big JSON logs.
Continuous materialized views of Timescale calculating aggregate metrics by cron jobs behind the scenes. Therefore aggregate metrics will be delayed for a while. Although the database supports real-time aggregation, calculating millions of raw log lines is compute intensive.
Dropping large hypertable chunks often causes issues as it relies on PostgreSQL’s autovacuum process, which leads to frequent disk space issues. This forces us to manage old data manually, adding overhead to the system.
Timescale v2 is supposed to solve a lot of these issues, but migrating to v2 is not straightforward.

Therefore we are keen to find alternatives to bring the new observability system to help customers monitor and troubleshoot issues more easily.

Why Grafana + Prometheus?

When searching for a replacement, Grafana and Prometheus immediately stood out as viable solutions.

Native Prometheus metrics were introduced from GraphQL Engine v2.13 and it has been continuously improved. The latest version supports 30+ metrics across all available features of the engine.
Grafana’s powerful visualization capabilities allow us to integrate multiple data sources (including Prometheus) and offer customizable dashboards. It also supports embedding dashboards into other web applications, making it a flexible solution.
We're already ingesting all of these metrics into Google Managed Prometheus for our internal observability and SRE use cases.

Prometheus is great, but…

While Prometheus is a great choice for metric storage and querying, its native data source in Grafana falls short for our use case. We need to ensure that Hasura Cloud users can only view metrics for the projects they own or collaborate on, which Prometheus cannot handle natively. One potential workaround is to create a separate Grafana instance for each organization, but this would be too costly to scale.

Enter the Hasura Prometheus connector

While Grafana itself doesn’t offer a built-in GraphQL data source, the community has developed GraphQL plugins, such as the Wild GraphQL Data Source, which can forward cookies and OAuth2 tokens to allow remote GraphQL services to handle authorization. With community-driven support for GraphQL in Grafana, we decided to leverage Hasura’s GraphQL Engine to manage the complex permissions required for Hasura Cloud projects.

To achieve this, we built a Prometheus connector for Hasura, allowing us to integrate project-specific observability with Hasura’s existing permissions system. We then deployed this as part of a new Data Delivery Network (DDN) supergraph, which utilizes the Prometheus connector to deliver metrics in a scalable and secure manner for individual projects.

New observability system architecture

The new architecture integrates Grafana, the Wild GraphQL Data Source plugin, and Hasura’s DDN to enable granular, project-specific observability for Hasura Cloud users.

Architecture of new observability system

Grafana, using the Wild GraphQL Data Source plugin, forwards the user’s browser cookies to DDN.
Hasura’s engine authenticates the user’s credentials with the control plane and retrieves the necessary session variables for project-based permissions.
The engine applies these permissions, then forwards the query to the Prometheus connector to fetch and display the relevant metrics.

A sample GraphQL query with results plotted on Grafana

Grafana + DDN + Prometheus = ♥️

With the new observability dashboard integrated into the Monitoring tab of each project’s detail page, users now have an intuitive, powerful way to monitor their projects. This is just the beginning—Hasura’s DDN architecture proves to be flexible and extensible, allowing for seamless integration with other databases such as PostgreSQL, ClickHouse, MongoDB, and more.

Hasura DDN + Grafana + Prometheus equals a match made in observability heaven, and we’re excited to share more updates as we continue to push the boundaries of what’s possible with Hasura DDN.

Read more about this dashboard on our docs.