How we use the Dataloader pattern for optimising GraphQL Joins at Hasura
Unlike joins across two tables in the same Postgres database, all “remote” joins must be performed by the GraphQL Engine, regardless of whether the data originates from a database or a remote schema: each source is independent from all the others, and only the Engine has the full knowledge of all of them. To perform such a join, we use a “dataloader” approach: we analyse the query ahead of its execution to identify each individual part, and we modify each individual sub-query to also extract the required join keys; the engine then issues them one by one and stitches the results together. For instance, consider the following GraphQL query:
query {
album {
name
artist {
name
}
}
}
Where album is on a local database, artist is on a remote schema, and the relationship between them is defined such that album.artist_id is the argument to artist. The engine would identify that there are two steps to this process: call the local database, to get the data from album, extract the relevant join keys, and perform a subsequent call to the remote schema. In practice, we would first issue a SQL query that would look like this:
SELECT
name
artist_id AS artist_id_join_col_1
‘placeholder’ AS artist
FROM
album;
Internally, we keep track of which columns were added for the purpose of retrieving join keys that must be removed from the result: we call those phantom columns. Additionally, we add a placeholder for where the result of the join will be, to preserve the order of fields as originally requested in the GraphQL query without having to do additional internal bookkeeping.
We can then issue a subsequent query to the remote schema for each artist; this time, without the need for placeholder nor join columns, as this is the last step of the process:
query ($hasura_var_1: Int!) {
artist(id: $hasura_var_1) {
name
}
}
This is how remote joins have always worked; but extending them to work across databases (and more recently across remote schemas) has made the process more complicated: there might now be nested joins, where the result of a previous join must now contain the join keys for the following one. The overall logic remains the same, however we do now internally construct a full join tree, an internal recursive representation of the steps to be performed. This structure is the union of all paths within the first query that will require a join: each such node, in turn, may contain a subsequent join tree, for the joins that must be performed on the result. An example of such a tree, for a more complicated example, would be something like this:
From the artist object, we perform two remote joins: one to get the album, and one to get a picture; however, after performing that second remote join, we perform another one, from the picture to its metadata: in our join tree, the picture node contained another recursive join tree.
In the engine, instead of the joins being one step, they are now performed iteratively: after performing the very first step of the query, we traverse the join tree. At each leaf, each join that needs to be performed, we repeat the process recursively with that node’s sub-tree; we perform, in essence, a depth-first traversal, constructing the resulting JSON object on the way back up.
However, when extending the Engine to allow joins from remote schemas, we encountered an additional difficulty, which forced us to refine how our join trees are defined: the path at which a join must be performed can be ambiguous! Consider, for example, the following example:
query ($nodeId: Int!) {
node(id: $nodeId) {
... on Artist {
articles {
name
}
}
... on Writer {
articles {
name
}
}
}
}
Depending on the actual GraphQL type of the object that is being returned, we must perform one of two possible joins: either one from artist to articles, or one from writer to articles. But, in both cases, the part of the answers’ object that needs to be modified is at the path node.articles! To be able to differentiate between the two, we need to be able to distinguish them in the join tree, and furthermore we must be able to know, in the engine, what the actual GraphQL type of the returned object is.
To do so, we have changed our join tree definition to allow each branch to be annotated with a typename: in this case, our join tree would have two joins: one at the path node.(Artist.articles), and one at the path node.(Writer.articles). We only do this for paths that depend on the runtime GraphQL type. And to retrieve the runtime type, we add a phantom field in the request that extracts the __typename of the object, that we then use when traversing the join tree.
To go further:
- Our join trees are very similar to tries, also known as prefix trees, with the difference that we only store data in the leaves: https://en.wikipedia.org/wiki/Trie
- The definition of this join tree in our code, with examples:
https://github.com/hasura/graphql-engine/blob/8b0b4e5c35d1628c1faa3986fe50598deba33e75/server/src-lib/Hasura/GraphQL/Execute/RemoteJoin/Types.hs#L44