Refactoring the Core and Generalising the Future
At the end of February, we had the pleasure of announcing the first alpha release of Hasura GraphQL Engine version 2.0 - the culmination of over six months of engineering effort across the company. This release contains some large refactorings to the core of the Hasura product, as well as some large new feature additions. I’m going to talk about the software engineering tasks we tackled, challenges we encountered and how we met them.
I’d like to say a big congratulations and thank you to all of my colleagues who were involved in the release - it’s a great achievement, and it wouldn’t have been possible without everyone’s dedicated work.
Summary of changes
The 2.0 release includes several milestone changes in the core of the product:
- Switching to the new “Parse Don’t Validate” approach
- Separating the storage of Hasura’s metadata from the Postgres database
- Support for multiple Postgres backends
- Support for new kinds of relational database backend, starting with SQL Server
- Support for inherited roles
- Support for REST endpoints in addition to GraphQL (which Lyndon has already covered in his blog post here)
Despite this long list of features, 2.0 is still backwards compatible with previous releases (after an automated metadata upgrade), so you can update your Hasura deployments today. We have started the process of rolling out 2.0 upgrades on Hasura Cloud already, and cloud users can expect the upgrade at some point in the coming weeks.
We’ll take a look at each of the larger changes in turn, and the issues we faced while implementing them.
Parse, Don’t Validate!
The journey from version 1.3.3 to version 2.0 began in August last year, when we started the process of refactoring the core of our application logic to reflect the “parse, don’t validate” (or PDV) principle. The idea is outlined in Alexis’ excellent blog post, but I’ll give a simplified overview:
- When receiving unvalidated inputs, attempt to convert (i.e. parse) them into values whose types refine the types of the inputs, ruling out invalid states where possible, using the information learned during validation.
- Don’t simply check for invalid states and pass along the inputs regardless (i.e. don’t just validate)
An example illustrates this nicely, and I’ll paraphrase the one from the blog post. If we receive a list of data as an input, and we wish to validate the property that the list is not empty, then we should not simply check for an empty list, and pass along the input unchanged. If the input list is in fact not empty, then we can give its value a stronger type - the type of non-empty lists. We’ve turned our validator into a parser which parses possibly-empty lists into non-empty lists, or fails with a validation error.
Simple examples like these are found everywhere in real-world code. If I encounter a date represented as a string, I can parse it into my programming language’s representation of an actual date at the point of validation. We see these simple examples all the time when (for example) parsing JSON structures into typed representations. But what does this have to do with the Hasura 2.0 release?
Server Metadata
In Hasura, we configure our running GraphQL servers with a data structure which we refer to as metadata. The Hasura metadata can be exported and manipulated as a JSON/YAML document, or manipulated indirectly by the running server when we perform operations such as tracking tables. It contains information about all of the data sources that we can talk to, any tables and columns references in relational data sources, relationships between those, permissions, functions, computed fields, and so on - everything needed to take your data and turn it into a functioning GraphQL service.
In older versions of the product, we would essentially validate the input metadata and store it directly in the application’s schema cache. The schema cache is our own in-memory copy of the metadata, optimized for the lookups we need to perform while servicing GraphQL operations.
As you can probably guess, in version 2.0, we’ve switched to parsing the input metadata as much as possible, refining it in various ways to capture all of the things we know about it after validation. For example:
- Unchecked table definitions in the input metadata get elaborated in the schema cache by adding relevant column information from the data source itself.
- Unchecked relationships between tables are parsed into objects graphs on the Haskell heap itself.
- Unchecked remote schema definitions in the input metadata get elaborated in the schema cache by adding schema information discovered by introspection. Any remote relationships in the metadata can also be cross-referenced with the result of introspection.
- Unchecked REST endpoint definitions in the input metadata get turned into a tree of validated endpoint definitions (ruling out any possible overlapping definitions), each accompanied by a parsed and validated GraphQL AST which will be used to serve any HTTPS requests on those endpoints.
Dealing with memory issues
The parse-don’t-validate approach allowed us to rule out a lot of error cases in our data, because after we have done our initial checks on the input data, we can use the Haskell type system to represent the condition on the data that we have checked. However, the approach came with a cost - after implementing and merging the PDV changes, we found that the application’s memory usage had increased on certain workloads. We weren’t able to detect any regression during development because the new system was functionally equivalent to the original (not counting memory usage), and we didn’t have adequate performance benchmarks in our CI system (lesson now learned - we will be integrating a regression testing benchmark suite into our build system soon).
This regression was a real concern because we had already merged the functionality to the main branch, and we had merged several other changes on top of the PDV changes before we noticed the regression. For a while, we maintained two parallel branches while we worked to identify the source of the memory issues, but this forced us to backport many fixes to the release branches, slowing down the development process.
Fortunately, we were able to identify the issue, collaborating with the excellent GHC / Haskell folks at Well-Typed: the brand new ghc-debug tool is able to let us inspect the Haskell heap at runtime, and look for unexpected patterns of behavior. In this way, we can identify conditions such as
- References which are escaping their intended scope
- Lazily-evaluated values which are not being forced in a timely manner (i.e. space leaks)
- Values which are being forced unnecessarily
- Values on the heap which are being duplicated instead of shared
In our case, we were able to notice two of the conditions above: a reference to a large value was leaking causing that value to be retained on the heap unnecessarily, and also we were not taking advantage of sharing on the heap. Using ghc-debug, Well-Typed helped us to identify and fix these two issues.
Now, the baseline memory consumption had gone up compared to the v1.3.3 release, and looking at the profile, we ruled out the case that there was an obvious memory leak, and instead decided that this was just a price of the new abstractions we are using. As it turned out, this was not the end of the story, but we were happy enough with the application’s memory patterns for a broad category of use cases.
Well-Typed has written about ghc-debug in the context of Hasura.
Supporting Multiple Sources
The second large change included in the 2.0 release is the support for multiple database sources, and for different kinds of sources. Specifically, we now support zero or many Postgres databases, and zero or many SQL Server databases, and we’re planning support for several other popular relational databases. In the longer term, we’ll be looking at supporting other non-relational sources as well.
From the user perspective, this dramatically simplifies the setup cost for a new Hasura Cloud deployment. There is no longer a need to have a database ready, or to create a database in order to get started. Simply create a fresh Hasura Cloud instance and open it - if you want to test out Postgres connections, you can now do so entirely from within the console, but you could also get started by connecting various actions and remote schemas instead, with no database sources whatsoever.
Multiple Postgres
On the engineering side, the first step was to support multiple Postgres sources, and this is not as simple as it might sound. We had to deal with things like namespace collisions across sources, connection pools whose lifetimes were no longer tied to the lifetime of the running server, several fundamental changes to the user interface, and changes to basic APIs which had assumed the existence of a single source database.
In order to support the case of zero Postgres databases, we also had to decouple the storage of Hasura metadata from the storage in the source database itself. Again, this was no small task, but we are now able to store our metadata in a completely separate database from your own data (via a new configuration option). In the case of Hasura Cloud deployments, we will host all of the metadata on our own servers, so it is no longer a user concern at all. In addition, we’ve taken the opportunity to improve several things about metadata storage, including adding optimistic locking to the data involved (avoiding race conditions during multiple updates in the UI), and simplifying the approach to synchronized schema updates across Hasura nodes in a high-availability setting.
Different databases
The second engineering step was to add support for multiple types of data source, including support for SQL Server. Again, this was a large refactoring effort, but thankfully we have been able to refactor with confidence due to Haskell’s strong type system and our extensive test suite.
Almost every aspect of the operation pipeline from parsing to type-checking to execution has been generalized in 2.0. It is not simply a case of generating the same SQL and sending it to a different type of server (wouldn’t that be nice!). Instead, we need to consider several issues:
- It may not be possible to push down N+1 queries and other sorts of operations to the database: in Postgres, we did this using jsonb aggregations, but we need to take a different approach with SQL Server and other sources
- Some Postgres data types are simply not available in other databases: json(b), ltree, geometry and geography, and so on
- Different databases may have subtly different behavior when it comes to things like table and column naming (case sensitivity, for example)
- We may have to turn off features entirely for different data sources, temporarily at least.
In order to solve these problems, we introduced several new type classes (similar to interfaces in OO languages, if you are not familiar) into our code, which defined and abstracted over the various backend operations we need to support:
- The Backend type class, which is the root of the new hierarchy of these abstractions, defining the data types for various leaf nodes and relationships in our AST. For example, the type of connection information is defined here, but also types such as the type of table names for a data source (you may think this should always be a string, but it may be a case-insensitive string, or even an integer if we consider supporting Redis).
- The BackendSchema type class, which is responsible for defining any GraphQL root fields exposed by a data source
- The BackendExecute type class, which is responsible for constructing execution plans, and BackendTransport which is responsible for executing them.
The idea is this: if you want to implement a new backend for Hasura, simply implement these new type classes. We will be trying to improve the development story in this area, making it as painless as possible to contribute new backends in this way.
Aside: the basic structure of the syntax trees used for Postgres and MSSQL are very similar, with differences only at the leaves of the tree, for example, to turn off features which do not exist or are not supported in SQL Server. We have implemented the ideas from the Trees that Grow paper (also used in the GHC compiler to progressively turn off AST features during the various phases of the compiler pipeline) in order to get maximal code reuse from the commonalities in these syntax trees.
Designing and implementing these type classes has proven to be straightforward enough, but we have gone back and forward deciding how they should be incorporated into the broader product design. The basic problem is that we have implemented various backend types such as Postgres and MSSQL (on which the type class implementations for the classes above will be pinned), but at the outermost layer of the application, we only want to present an API which talks abstractly about a collection of some database sources, not any particular implementations in general. That is, in our source list, we might want Postgres and SQL Server implementations side by side in the same list - and mixing values of different types in a single list is not something that Haskell’s type system is particularly happy about.
Fortunately, Haskell allows us to unify the various backend implementations in a single list using a feature called existential types, and this is what we have decided to use to solve the problem. To the extent that type classes are like interfaces in OO languages, this approach is similar to using bounded quantification: we erase enough type information to store values with different underlying implementations in the same list, but keep their specific implementations available in the form of their type class instances. Just like bounded quantification, the downside is that we can no longer take advantage of features specific to (let’s say) the Postgres implementation, but only access the sources via their type class instances.
Memory issues, again
As we were nearing the end of the development cycle for 2.0, we ran into another unfortunate memory issue. We had already eliminated the memory leaks introduced by the PDV refactoring, but something strange was now showing up with mutation-heavy workloads.
During workloads involving long bursts of mutations (inserts, updates, etc.), we were seeing expected increases in memory consumption. However, after the burst completed, we expected the garbage collector to do its job to return the memory consumption back to the baseline amount (i.e. the amount needed to store the schema cache in memory), but this was not happening. Instead, the final memory consumption was quite a lot higher than expected. Obviously, this is not good for either customer or cloud deployments.
Some initial investigation and graph plotting revealed something going on in the runtime system: the memory usage as reported by the operating system was consistently higher than the number reported by the GHC runtime, even after the burst of mutations ended. It seemed as though memory was not being returned promptly to the operating system by the runtime system.
Again, with the help of the GHC experts at Well-Typed, we were able to satisfactorily resolve this. To some extent, it is expected that GHC will try to hold onto allocated memory in case it needs it in future, but this is obviously not good when we expect bursty workloads. Therefore, in a future version of GHC, there will be additional flags available to control the rate of return of memory to the OS.
In addition, it turned out that there was a bug in the GHC runtime system, which has now been addressed: pinned memory was being allocated in a suboptimal way by the runtime.
Now we see the memory consumption return to the expected baseline after bursts of mutations, and a generally healthy memory profile during similar workloads.
Conclusion
2.0 was a large milestone release, including many large features and refactorings, and we saw all of the trade-offs of using a language like Haskell for our development. On the one hand, we can achieve a very high rate of development on complex features, and we are able to refactor confidently, thanks to the strong guarantees of the Haskell type system. This is the benefit of Haskell that you are most likely to hear when you read blog posts like this one.
However, using Haskell certainly comes with a large cost, and it’s necessary for companies using languages like Haskell to really invest in the ecosystem and community, both in terms of time and money. We’ve been working with Well-Typed to support the development of tools we need in order to make Haskell easier to work with for the development of an application this size. At the same time, we’re able to make these tools available to the broader Haskell community, increasing the quality of Haskell tooling for everyone. As we use Haskell to build the next generations of Hasura features, we will continue to do our best to contribute to these improvements to Haskell in an open way.
On the feature side, we’ve now set ourselves up for the development of many new features: we’ve tried to commoditize the development of new database backends, and we are also well set up to support next-level features such as generalized joins and generalized permissions. This will allow us to stitch together all of your different data sources into a single, unified graph with all of the additional Hasura features that we’ve featured so far.