Building high-velocity data quality feedback loops: A modern approach to enterprise data health

The hidden cost of poor data quality

Large organizations perennially struggle with data quality issues – a challenge that extends beyond mere inconvenience. Poor data quality results in lost opportunities, reputational damage, increased operational costs, and potential regulatory exposure.

And let's not forget about AI and generative AI. The number one problem for all organizations trying to leverage these technologies is access to high-quality data.

What if there was a better way?
Imagine a simple, standardized, data-source-agnostic data quality service that could complement existing approaches and dramatically improve data health through rapid feedback loops.

The key to this transformation lies in understanding and optimizing how information flows between data producers and consumers. This brings us to one of the most powerful concepts in systems design: feedback loops.

The power of feedback loops in data quality

In classic systems design, feedback loops connect (in our case) data producers and data consumers (analysts, scientists, and application owners). When implemented effectively, these loops don't just identify issues. They address root causes, foster collaboration, and drive behavioral change.

The result? What system designers call a "positive reinforcing loop" or "virtuous cycle" of continuous improvement.

Current approaches and their limitations

Organizations typically manage data quality by combining two primary approaches:

Data incident systems where users report issues through workflow tools
Rule-based quality checks implemented by domain teams at the source

While both are essential, they face a fundamental limitation: Data quality at the point of delivery often involves combining (or compositing) data from multiple domains.

Domain data teams focused on their specific data sets will never anticipate all downstream use cases where their data might be combined with other domains. Each new downstream use case may surface quality requirements that individual domain teams couldn't have imagined.

For example, a sales total might be valid within its domain, but new quality considerations emerge when combined with inventory data for margin calculations. So, domain-based data quality assessment at point-of-origin won’t solve this issue.

In this scenario, the reporting teams or applications compositing this data must own the quality assessment. Typically, these are done as domain-specific or application-specific approaches, sometimes in code and sometimes by relying on user feedback. These issues are generally logged in a data quality incident system, often with some manual curation.

Most organizations have thousands of data incidents tracked in workflow systems, and if your experience is like mine, resolution is not quick. This creates a long, slow feedback loop where issues are discovered far from their source, with minimal facts, making root-cause analysis at best difficult or sometimes impossible.

The challenge isn't managing incidents – it's creating rapid feedback loops by identifying these data quality issues earlier, at the point of delivery, and providing structured feedback that accelerates resolution.

The organizational landscape

Before exploring technical solutions, it's crucial to understand how data responsibilities typically flow through a large enterprise. In most organizations, three distinct teams share responsibility for data quality, each with its focus and challenges:

Domain data owners

Create and manage data models and relationships
Define entity-level quality rules
Ensure data provisioning
Maintain technical and universal fit-for-use standards

Federated data team

Oversee metadata standards
Provide data discovery tools
Manage federated data provisioning
Enable self-service through platform-agnostic tools

Cross-domain data teams

Create derived data products
Build reports, applications, and ML models
Generate new metrics and aggregates
Focus on specific business proposals

The obvious problem is the cross-domain data users. For the most part, this is the last mile of data. This data often goes to the Board, or the C-Suite, or into a regulatory report.

What resources do these teams have to effectively communicate their issues to the systems of origin?

Building a next-generation quality service

Core requirements

To address the challenges of these teams, particularly cross-domain data users, we need capabilities that bridge gaps while promoting collaboration and standardization.

A modern automated data quality service should deliver the following:

Real-time validation during data delivery, i.e., in-line validation
Centralized rule standards with local control flexibility
Integration with observability tools
Standardized metadata-driven rule results
Self-service data composition capabilities

If such a system were in place, instead of discovering issues downstream or relying solely on source-system validations, teams could validate data against defined rules at the point of delivery.

Technical foundation

This is not technically trivial and requires buy-in across several teams to deliver. But it's doable, and the benefits far outweigh the implementation challenges.

Success depends on positioning the solution correctly to key stakeholders, particularly:

Technology leaders who understand the long-term maintenance benefits
Business stakeholders who see the impact of data-driven decision-making
Data teams who recognize the reduced incident management overhead

Prerequisites
Building an effective data quality feedback system isn't just about adding in-line, point-of-delivery validation rules – it requires the right foundation.

These prerequisites ensure your solution scales across the enterprise while maintaining performance and flexibility:

An extensible metadata-driven data access layer
Automatable data quality rules
A virtualization-like capability (e.g., Trino or Hasura DDN)

Specifically, combining all data into one location does not solve this issue. The obvious choice is a virtualization-type approach like Trino, a database compute layer, or Hasura Data Delivery Network (DDN), an automated, metadata-driven data access layer.

The big idea with a virtualization-like capability is not to move data but to create and operationalize a semantic layer across all data sources using one of these products. This allows the core data teams to focus on building high-quality datasets, provides self-service tools for downstream teams to create use-case-specific datasets, and provides an integration point for augmenting data services.

Implementation example

To demonstrate this approach in practice, I built a working example using three key technologies:

Hasura DDN's plugin ecosystem for data access
JSON Schema standard for creating machine-readable rule definitions
AJV for generating standardized rule results

This combination provides a robust foundation while remaining relatively simple to implement.

The complete code implementing the Hasura Plugin Hub and Data Validator plugin is available at https://github.com/hasura/plugin-hub. This open source solution is written in NodeJS/TypeScript but can be adapted to your preferred technology stack since the Hasura DDN plugin framework uses a simple HTTP contract.

Installation instructions are in the README file. You can deploy the entire system as-is or use it as inspiration for your own implementation.

Transformative benefits

This approach delivers significant advantages:

Proactive detection

Catches issues during data delivery
Limits downstream impacts

Enhanced communication

Provides machine-readable results
Enables automated analysis

Clear accountability

Maintains domain team ownership
Enables consumer-specific validation

Standardized approach

Uses consistent metadata-driven definitions
Provides uniform result formatting

Accelerated resolution

Delivers precise issue details
Eliminates ambiguity in problem reporting
Dramatically reduces the time between issue creation and resolution
Helps build the “virtuous cycle” of continuous improvement

Getting started

My suggestion is to consider a focused pilot:

Choose a high-value cross-domain use case
Implement basic validation rules
Measure impact and effectiveness
Build momentum for expansion

To be clear, this is an automation solution, you need a process design to create the feedback loop. The automation provides the ingredients to create a workable process, but the process is up to you.

Following this approach will reduce implementation risk while demonstrating tangible value to stakeholders. It will also create advocates for broader adoption across the enterprise.

Future possibilities

This foundation opens doors to advanced capabilities:

Anomaly detection
Sophisticated data profiling
Workflow system integration
Cross-platform quality controls

Because this concept provides a standard, metadata-driven approach, organizations can implement consistent quality controls across diverse data platforms and tools. This standardization extends beyond just the technical implementation – it creates a common language and shared understanding of data quality requirements and issues among all stakeholders.

Most importantly, it enables high-velocity feedback loops that were previously impossible. When data quality issues are caught early, described precisely, and communicated in a standard format, teams move from reactive firefighting to proactive quality management.

Teams can start simple and evolve to more sophisticated approaches while maintaining clear responsibilities and accelerating the feedback cycle.

Moving forward

The path to better data quality isn't just about technology – it's about creating sustainable, efficient processes that connect data producers and consumers. By implementing automated feedback loops at the point of delivery, organizations dramatically reduce the time and effort required to identify and resolve data quality issues.

The approach outlined here provides a framework that is: