Transforming data management in biopharma with Hasura
A global biopharmaceutical leader recently embarked on a transformative journey to enhance data management and access capabilities within its research division. In the modern pharmaceutical landscape, data serves as the heartbeat, driving everything from research to commercialization. The company recognized that a robust data platform is crucial to its success.
The director of data platform at the company said, "Our goal is to ensure that our machine learning (ML) engineers and scientists have frictionless access to the data they need. When they complete their analysis, there is a repository where they can deposit, catalog, iterate, and version the data. We want to create that experience."
In mid-2023, the research engineering team at the company set out to build that experience for the researchers and ML engineers in their computational chemistry subdivision.
Poor data discovery and search hinder innovation
Previously, the computational chemistry research group used a niche data-sharing solution called PIMS for uploading and sharing data frames. While PIMS is widely used within the R community, it wasn't designed to be the enterprise data catalog and management tool the company needed. It lacked search and discovery capabilities and was tightly coupled with the R language framework, restricting its usability.
To address this, the research engineering team set out to build a data warehouse with a robust custom data catalog service that offered excellent searchability and governance. A principal informatics data engineer led this initiative.
The engineer explained their goal: "We wanted to let everyone share whatever data they wanted – run our own pipelines and make it super easy for other users to find and download that data."
Why build a custom metadata service
The data platform team selected AWS S3 as their data warehouse due to its suitability for the unstructured and semi-structured data generated in computational chemistry. They use PostgreSQL to store all metadata and schema information about the data assets in their warehouse, with Hasura powering the metadata service on top of PostgreSQL.
Building a custom metadata service with Hasura and PostgreSQL gave the team the self-service discovery and access they desired, without sacrificing governance. Hasura offers a robust API for its metadata, facilitates validation (a crucial aspect of any self-service portal), keeps the registry synchronized with the lakehouse, and enforces access control rules. This would not have been possible with an off-the-shelf data catalog solution.
1000x better than a typical data catalog
Like most Fortune 500 enterprises, this biopharmaceutical company has an expansive software portfolio that includes various data catalog solutions, such as Axon EDC, DataZone, and Collibra. However, their implementation with Hasura addresses a unique use case beyond traditional cataloging.
The director of the data platform team explains that Hasura far surpasses traditional catalogs by providing critical integration with data warehouses – an essential capability for effective governance and self-serve access. He added, “For this use case, Hasura is a thousand times more productive than any other tool.”
The Hasura API powers the data catalog UI, allowing scientists to easily browse, search, and discover the data they need. They use Hasura Actions to generate signed URLs to download data sets from the warehouse. This service is also available via an SDK, enabling their more programming-oriented stakeholders, such as ML engineers, to download and upload data to the warehouse.
Accelerating R&D with self-service access
With this new system, researchers can now register and upload their data assets, such as prediction datasets from ML jobs, reducing dependency on a central IT team. The Hasura eventing system orchestrates automated validation between the metadata and the warehouse to ensure data is correctly defined and uploaded, enabling seamless end-to-end self-service on data uploads.
The data platform director added, “This self-service nature is really helping the team go faster, without being bottlenecked by a central team.”
Hasura's eventing system has been highly beneficial for the company. It helps keep the metadata store synchronized with the data lakehouse by automatically taking corresponding actions when users add or delete objects in the metadata catalog. This automation eliminates the need for custom solutions like SQS, reducing operational complexity.
Shaving off six months of engineering effort
Building the metadata service API with Hasura significantly slashed their time to market for this critical initiative. Hasura provides feature-rich APIs with fast searchability across many foreign-keyed relationships in their PostgreSQL database, eliminating the need to write custom data loaders. It also removed the need to manually handle the N+1 query performance problem that arises with joins in hand-rolled APIs.
The principal engineer estimated, “Just for our use case, it would probably take a good six months to build out all the features that Hasura comes with out of the box”
A notable side benefit is that computational scientists, who had never worked on server-side applications before, can now contribute to API building. The principal engineer comments that Hasura's ability to generate an API has fostered collaboration in the research and engineering team: “Hasura enables better collaboration because the bar to contribute is just design your table rather than design a whole API.”
Powerful authorization simplifies access control
Hasura's extensible authorization system enabled the company to serve multiple stakeholder teams from the same data platform, such as large and small molecule teams. They use an authorization endpoint in Hasura to query their enterprise LDAP server to verify user permissions, which Hasura then uses to enforce access privileges against the catalog and the warehouse.
Hasura also acts as a proxy performing a behind-the-scenes role exchange with AWS, to control what datasets the user can access from the S3 lakehouse. The data platform leader is a fan of the operational efficiency this brings: “Doing this in AWS directly would take a lot of manual work. Hasura simplifies that process.”
Impact and looking forward
With this first use case successfully in production, the team is excited about the prospect of extending Hasura to streamline the data consumption patterns in other areas. For example, the director of the data platform team shared that they are exploring using Hasura to provide self-documented API access to their Athena tables, instead of the current JDBC access pattern, which has some limitations.
The team wants to build a data ecosystem that helps their researchers be more effective. That's what drives their work. Simplifying data access is a big part of that story!