Best Practices Guide for GraphQL Observability with Hasura [Part 1]

Hasura is an open-source product that accelerates API development by 10x by instantly giving you GraphQL or REST APIs on your data. Hasura plays a central role in your application stack as it can call out to your other services and databases while adding authorization and caching on top. Your development teams can quickly access critical data that powers their applications. Please see our website for more information.

Note: This is a multi-part series. The current post delves into the various Observability metrics relevant for apps built with GraphQL / REST APIs. In the next parts, we will talk about the use cases and how Hasura specifically solves them.

Observability

As your app grows and becomes more complex, it becomes increasingly difficult to understand how it is used and performs. This is where observability comes in. Observability is the ability to gauge a system’s internal conditions by looking at its outputs, such as metrics or logs. Observability means you can:

Gain insights into the functionality and health of your systems, collect data, visualize them, and set up alerts for potential problems.
Have distributed tracing provide end-to-end visibility into actual requests and code, helping you improve your application’s performance.
Audit, debug, and analyze logs from all your services, apps, and platforms at scale.

Because Hasura can act as a central gateway for your application stack, we take observability very seriously.

SLOs, SLIs, and the Three Pillars

The path to thoroughly observing any app is to ensure you have your SLOs defined, SLIs identified, and telemetry data that enable the collection of those SLIs. In production, your SRE team would define service level objectives (SLO) that are intended to measure the reliability of your app. Each SLO is a target value or range of values for a service level measured by a service level indicator (SLI). For example, a service level indicator could be the number of requests per second, and an SLO could be that our app needs to be able to serve a minimum of ten thousand requests per second.

Once our SLOs and SLIs are defined, we use the three pillars of observability to ensure we are meeting our objectives.

The Three Pillars of Observability

Observability relies on three key pillars: logs, metrics, and traces. Access to these components does not automatically make systems more visible, but when used together, they can provide powerful tools for understanding a system’s internal state.

Metrics are used to determine if there’s a problem to begin with. They provide snapshots of a system’s metadata, such as server resources used or requests per second. Metrics can monitor performance, identify trends, and set alerts for potential troubles.
Traces, used to find where the problem is, provide end-to-end visibility into a system’s operations by tracing requests through all the different services. Traces can show how a system is functioning, identify bottlenecks, and improve performance.
Logs, used to determine what the problem is, provide timestamped messages of events in a system. You can use them to debug issues.

Hasura Cloud and Enterprise come out of the box with logs, tracing, and the following metrics:

Request rate
Error rate
Average request execution time
Active subscriptions
Number of WebSockets open

All observability data is available via the Hasura Console, one of our APM integrations such as Datadog and New Relic, or any APM receiver that supports the OpenTelemetry specification.

With the traditional way of rolling your own GraphQL server, you have to manually set up all your resolver logic, observability, and authorization code.

Hasura Cloud and Enterprise come with support out of the box.

Metrics

Request Rate

The request rate represents how many requests a service receives per unit of time, such as 100 requests per minute (RPM). This is the most fundamental metric, as you must design your entire architecture around your expected request rate. As this metric grows, you must scale up your system with techniques such as database sharding or risk your system being overwhelmed and unreliable.

How to Use

Request rate tracks the overall usage of a system. It becomes too performance-intensive at high request rates to monitor every request. Instead, you would sample a percentage. Some other best practices:

Scale our services up and down dynamically based on the request rate. An example is Hasura Cloud’s automatic horizontal scaling.
If there are trends in RPM, such as more requests in the morning, you can pre-provision infrastructure.
When RPM spikes above expected usage, an alert should warn of a possible DOS attack.
High RPM from specific users may warrant setting up IP or auth-based rate limiting.

Error Rate

The error rate represents how many errors a service receives in a period of time. Most HTTP APIs signal errors via HTTP status codes, 400 for client errors and 500 for server errors. GraphQL differs because most errors are in the body’s errors key. Hasura automatically parses both forms of errors for you.

Errors are an inherent part of distributed systems. Network communication is complex, and a lot can go wrong.

How to Use

Error rates are a good indicator of the overall health of your systems and are often the quickest way to spot problems. We can distinguish apparent disruptions over background noise by viewing the error rate over time and looking for patterns.

A couple of ways you can use the error rate are monitoring to quickly roll back deployments when deploying service updates using techniques like canary or blue-green deployments or triggering an alert anytime the error rate is over a baseline amount.

Viewing Individual Errors

For REST HTTP APIs, the HTTP status code tells us general information about the error. For example, HTTP 429 “indicates the user has sent too many requests in a given amount of time”. Based on the code, we can begin troubleshooting.

GraphQL does not use HTTP status codes unless it’s a server error. Instead, the response object has an errors key that you need to check for errors.

Average Request Execution Time

The average request execution time is calculated by the total time taken by all requests during time range / time range in minutes.

How to Use

Latency is a huge factor in user experience, so as developers, we should try as hard as possible to reduce this metric. If your data allows it, caching built into Hasura is a significant first step in lowering execution time. Using tracing, which we learn about in an upcoming section, we can see remote APIs that need to be optimized or queries for which we may need to add an index.

Subscriptions

With Hasura GraphQL subscriptions, there are two main components: database subscriptions and the WebSocket connection to the client.

How to Use

Primarily useful to monitor performance, the subscription metrics allow you to see the usage of GraphQL subscriptions. You can see a list of clients connected via WebSocket. The database subscriptions feature helps you diagnose any database performance issue related to subscriptions. To understand the mapping between a WebSocket client and the underlying database subscription, please check out this article on how we scaled to one million GraphQL subscriptions.

Logs

Once metrics have identified an incident, you can drill into individual logs to find the root cause. Hasura has many logging options, such as choosing what layers emit logs, so please read the documentation. Here are a few ways we can use the Hasura logs to monitor our system:

Aggregating by Query Name

With typical REST HTTP APIs, we aggregate our statistics by the endpoint. Since GraphQL uses one endpoint for all requests, we should combine the security feature of allow lists with aggregating by operation name. This reduces our monitoring surface drastically because all queries are determined ahead of time.

This view is built-in to Hasura Console but can be done with any APM system using the data in the query-log layer.

Diagnosing Errors

Requests to Hasura record success or error in the http-log layer. Once your metrics alert on an abnormal error rate, search your logs for the level: error. The response objects have an error key with the GraphQL error.

Distributed Tracing

A request can go through many services; therefore, it can be difficult to see where it failed or slowed down. Distributed tracing via the B3 spec combined with observability tooling allows you to follow a request through its life and pinpoint issues.

Hasura, acting as the API gateway, can trace external APIs and database queries.

External API Tracing

When a request comes into Hasura, it generates a B3 header if one doesn’t already exist. Whenever an API encounters a B3 header, it will report its request metrics to an APM system. The APM system then aggregates all the information to build a trace for each request, allowing you to visualize them efficiently.

Database Tracing

There is no standard method for trace propagation in Postgres, so Hasura injects a transaction-local JSON variable.

To get the trace information into database logs and native database monitoring tools, query tags are used. Query tags are SQL comments appended to the generated SQL statements with tracing metadata. Tools like pganalyze can use them to help you optimize your queries.

Quickly and easily set up a GraphQL API and explore Hasura’s observability features for free on Hasura Cloud.

Summary

In conclusion, observability is a critical aspect of modern systems. By collecting and analyzing data from logs, metrics, and traces, you can gain valuable insights into the health and performance of your services. With the right tools and techniques offered out of the box with Hasura Cloud and Enterprise, you can improve the observability of your systems and ensure that they are functioning optimally.

Arjun Yelamanchili

06 Jan, 2023

6 MIN READ

Blog

06 Jan, 2023