Introducing Aggregation Queries for Computed Fields

The Hasura team has recently been digging back into GraphQL Engine's query parsers and SQL generators while working on implementing the new logical models feature. While we were there, we also managed to tick off a long-requested community feature: aggregation queries for computed fields. Let's dig into the problem and how we've solved it!

What's the problem?

Hasura has supported the idea of aggregation queries for a long time. As a quick refresher, consider an articles table, which we'd normally query like so:

query {
  articles {
    title
    content
    rating
  }
}

With aggregation queries, we can ask questions about the data set as a whole:

query {
  articles_aggregate {
    aggregate {
      count # How many articles have been written?
      avg {
        rating
      } # What's the average rating for all articles?
      min {
        rating
      } # What's the lowest rating for any article?
    }
  }
}

There are two types of aggregation queries: those that are numeric (such as sum and avg), and those that are comparable (max and min). Up until now, these queries could be applied respectively to any field with a type that is considered numeric or comparable, unless that field happened to be a computed field.

Computed fields allow us to add extra fields to a GraphQL schema node using custom SQL functions. These functions receive a result row as input, calculate an output, and then that output is added to the result. For example, consider the following function that computes the word count of a given article:

CREATE FUNCTION word_count(article_row articles)
RETURNS BIGINT AS $$
SELECT ARRAY_LENGTH(string_to_array(article_row.content, ' '), 1)
$$ LANGUAGE sql STABLE;

We can add this as a computed field to our articles table, and query it just like any other field:

query {
  articles {
    title
    content
    rating
    word_count
  }
}

This would all work perfectly well, but things would get complicated when we wanted to know the average word count of all our articles, for example. One way would be to query the word_count for every row in the entire table and compute the average client-side, but that quickly becomes intractable as our tables grow to thousands of rows!

Another way to work around this is to add an extra column to the articles table for word_count, and add a database trigger to compute the character count. This has a few drawbacks:

It requires a table schema change. This means a migration, potentially having to rework queries throughout your application, updating Hasura permissions, and so on. It may also mean asking a database administrator to add a ticket to their backlog, which all contributes to a slower development cycle.
There's quite a lot of room for error. For a simple computation like this one, we just need to remember to recompute the word count whenever we INSERT or UPDATE. However, plenty of factors can make it much more difficult to determine when to recompute the values. What if we require information from multiple tables? What if we compute something based on the current time?
We have to start this whole process over again when we're asked for another aggregation! It's easy to end up with a fragile tower of triggers once we start needing several different aggregation queries.

If only we could aggregate computed fields

Clearly, it would be better to compute the value of the computed field when we require it with the current state of our application data, and that's what aggregation queries over computed fields give us:

query {
  articles_aggregate {
    aggregate {
      avg {
        word_count
      } # What's the average word count for all articles?
      min {
        word_count
      } # What's the shortest word count of any article?
    }
  }
}

As of Hasura v2.27 for Postgres, computed fields can be aggregated just like any other fields!

How does it work?

Parsing computed fields in aggregation queries

Let's briefly go through the lifecycle of a GraphQL Engine query. When we make a request, this request is parsed by a parser built specifically for your schema and metadata configuration. This means that, for example, you can only query a table that exists (and you're permitted to see). Otherwise, it's a syntax error - the parser doesn't recognise your requested table as a valid table name!

Consequently, the first step to implementing aggregation queries on computed fields is extending the aggregation query parser to recognise computed fields as something we know how to aggregate. Now, as well as looping over a table's columns to determine which columns can be aggregated, we also loop over the computed fields. At this stage, we're not generating any SQL: we're simply declaring what can and cannot be aggregated. In the case of computed fields, we can now aggregate any fields whose types are a comparable (in the case of min and max) or numeric (in the case of avg, sum, and so on) scalar type. Computed fields that return table-like types cannot be aggregated.

Intermediate representation

With the parser extended, we can make an aggregation query on a computed field without a syntax error, and we can move on to interpreting the query into something useful. GraphQL Engine has its own intermediate representation (IR) outputted by the schema parser. This is an internal data structure whose core describes the query we want to make in a backend-agnostic language. It also includes relevant user configuration such as permission filters, whether to stringify numbers, and so on.

As with the parsers, the first step of the change is to extend the IR to allow computed fields to appear in aggregation operations. In practice, this means adding another constructor to the type of allowed fields in aggregations:

data SelectionField -- Simplified for illustration purposes
	= SFColumn ColumnName ColumnType
+   | SFComputedField ComputedFieldName ComputedFieldSelection
	| SFExp Text

One of the joys of working with Haskell (or any language with exhaustive pattern matching) is that, at this point, the compiler will find every place where SelectionField values are used that now need to add a case for computed fields, and the rest of the work becomes a matter of fixing compiler errors. After fixing a few dozen of those, all that's left is to generate the SQL!

IR to SQL

At this point, we're still just following the compiler's breadcrumbs, and we find our way to the aggregation query generator. Let's say we make a request for the average rating of all articles. The generated query probably ends up looking a bit like this:

SELECT 
  -- (3)
  json_build_object(
    'aggregate', 
    json_build_object(
      'avg', 
      json_build_object(
        'rating', 
        avg("rating")
      )
    )
  ) AS "root" 
FROM 
  (
    SELECT 
      -- (2)
      "_root.base"."rating" AS "rating" 
    FROM 
      (
        SELECT 
          -- (1)
          * 
        FROM 
          "public"."article" 
        WHERE 
          ('true')
      ) AS "_root.base"
  ) AS "_root"

This is the anatomy of an aggregation query:

First, we query the table for the rows we wish to aggregate, which we name _root.base. This would also be where filters (such as permissions) are applied.
From that query, we select the fields we want to aggregate (internally, we refer to this step as applying the extractors), which we call _root.
Finally, we call the aggregation functions with the extracted fields in _root, producing the eventual response object as root.

In order to handle computed fields, we need to compute the result of the computed field before we call the aggregation functions in step (3). Because computed fields take the whole row as an input, most of our new complexity sits in step (2): if any of the fields we're querying are computed fields, then we need to extract the entire row, rather than individual fields. If none of the fields we're querying are computed fields, then we preserve the old behaviour.

The new behaviour can be seen in a query that uses both a regular column and a computed field:

SELECT 
  -- (3)
  json_build_object(
    'aggregate', 
    json_build_object(
      'avg', 
      json_build_object(
        'rating', 
        avg("rating"), 
        'word_count', 
        avg(
          "public"."word_count"("_root")
        )
      )
    )
  ) AS "root" 
FROM 
  (
    SELECT 
      -- (2)
      "_root.base".* AS "entire_table_row" 
    FROM 
      -- (1)
      (
        SELECT 
          * 
        FROM 
          "public"."articles" 
        WHERE 
          ('true')
      ) AS "_root.base"
  ) AS "_root"

As we can see, step (2) now extracts the entire row for word_count, allowing us to aggregate computed fields, while also not breaking aggregates over regular fields. With the SQL generation done, it's up to Postgres to execute the query and return some results!

Wrapping up

Thanks to some recent work in the schema parsers and query generators, this ended up being a very satisfying change to make. Aggregation queries on computed fields will be available in v2.27 for Postgres, and we're excited to see what you do with them!