The hidden cost of poor data quality
Large organizations perennially struggle with data quality issues – a challenge that extends beyond mere inconvenience. Poor data quality results in lost opportunities, reputational damage, increased operational costs, and potential regulatory exposure.
And let's not forget about AI and generative AI. The number one problem for all organizations trying to leverage these technologies is access to high-quality data.
What if there was a better way?
Imagine a simple, standardized, data-source-agnostic data quality service that could complement existing approaches and dramatically improve data health through rapid feedback loops.
The key to this transformation lies in understanding and optimizing how information flows between data producers and consumers. This brings us to one of the most powerful concepts in systems design: feedback loops.
The power of feedback loops in data quality
In classic systems design, feedback loops connect (in our case) data producers and data consumers (analysts, scientists, and application owners). When implemented effectively, these loops don't just identify issues. They address root causes, foster collaboration, and drive behavioral change.
The result? What system designers call a "positive reinforcing loop" or "virtuous cycle" of continuous improvement.
Current approaches and their limitations
Organizations typically manage data quality by combining two primary approaches:
- Data incident systems where users report issues through workflow tools
- Rule-based quality checks implemented by domain teams at the source
While both are essential, they face a fundamental limitation: Data quality at the point of delivery often involves combining (or compositing) data from multiple domains.
Domain data teams focused on their specific data sets will never anticipate all downstream use cases where their data might be combined with other domains. Each new downstream use case may surface quality requirements that individual domain teams couldn't have imagined.
For example, a sales total might be valid within its domain, but new quality considerations emerge when combined with inventory data for margin calculations. So, domain-based data quality assessment at point-of-origin won’t solve this issue.
In this scenario, the reporting teams or applications compositing this data must own the quality assessment. Typically, these are done as domain-specific or application-specific approaches, sometimes in code and sometimes by relying on user feedback. These issues are generally logged in a data quality incident system, often with some manual curation.
Most organizations have thousands of data incidents tracked in workflow systems, and if your experience is like mine, resolution is not quick. This creates a long, slow feedback loop where issues are discovered far from their source, with minimal facts, making root-cause analysis at best difficult or sometimes impossible.
The challenge isn't managing incidents – it's creating rapid feedback loops by identifying these data quality issues earlier, at the point of delivery, and providing structured feedback that accelerates resolution.
The organizational landscape
Before exploring technical solutions, it's crucial to understand how data responsibilities typically flow through a large enterprise. In most organizations, three distinct teams share responsibility for data quality, each with its focus and challenges:
Domain data owners
- Create and manage data models and relationships
- Define entity-level quality rules
- Ensure data provisioning
- Maintain technical and universal fit-for-use standards
Federated data team
- Oversee metadata standards
- Provide data discovery tools
- Manage federated data provisioning
- Enable self-service through platform-agnostic tools
Cross-domain data teams
- Create derived data products
- Build reports, applications, and ML models
- Generate new metrics and aggregates
- Focus on specific business proposals
The obvious problem is the cross-domain data users. For the most part, this is the last mile of data. This data often goes to the Board, or the C-Suite, or into a regulatory report.
What resources do these teams have to effectively communicate their issues to the systems of origin?
Building a next-generation quality service
Core requirements
To address the challenges of these teams, particularly cross-domain data users, we need capabilities that bridge gaps while promoting collaboration and standardization.
A modern automated data quality service should deliver the following:
- Real-time validation during data delivery, i.e., in-line validation
- Centralized rule standards with local control flexibility
- Integration with observability tools
- Standardized metadata-driven rule results
- Self-service data composition capabilities
If such a system were in place, instead of discovering issues downstream or relying solely on source-system validations, teams could validate data against defined rules at the point of delivery.
Technical foundation
This is not technically trivial and requires buy-in across several teams to deliver. But it's doable, and the benefits far outweigh the implementation challenges.
Success depends on positioning the solution correctly to key stakeholders, particularly:
- Technology leaders who understand the long-term maintenance benefits
- Business stakeholders who see the impact of data-driven decision-making
- Data teams who recognize the reduced incident management overhead
Prerequisites
Building an effective data quality feedback system isn't just about adding in-line, point-of-delivery validation rules – it requires the right foundation.
These prerequisites ensure your solution scales across the enterprise while maintaining performance and flexibility:
- An extensible metadata-driven data access layer
- Automatable data quality rules
- A virtualization-like capability (e.g., Trino or Hasura DDN)
Specifically, combining all data into one location does not solve this issue. The obvious choice is a virtualization-type approach like Trino, a database compute layer, or Hasura Data Delivery Network (DDN), an automated, metadata-driven data access layer.
The big idea with a virtualization-like capability is not to move data but to create and operationalize a semantic layer across all data sources using one of these products. This allows the core data teams to focus on building high-quality datasets, provides self-service tools for downstream teams to create use-case-specific datasets, and provides an integration point for augmenting data services.
Implementation example
To demonstrate this approach in practice, I built a working example using three key technologies:
- Hasura DDN's plugin ecosystem for data access
- JSON Schema standard for creating machine-readable rule definitions
- AJV for generating standardized rule results
This combination provides a robust foundation while remaining relatively simple to implement.
The complete code implementing the Hasura Plugin Hub and Data Validator plugin is available at https://github.com/hasura/plugin-hub. This open source solution is written in NodeJS/TypeScript but can be adapted to your preferred technology stack since the Hasura DDN plugin framework uses a simple HTTP contract.
Installation instructions are in the README file. You can deploy the entire system as-is or use it as inspiration for your own implementation.
Transformative benefits
This approach delivers significant advantages:
Proactive detection
- Catches issues during data delivery
- Limits downstream impacts
Enhanced communication
- Provides machine-readable results
- Enables automated analysis
Clear accountability
- Maintains domain team ownership
- Enables consumer-specific validation
Standardized approach
- Uses consistent metadata-driven definitions
- Provides uniform result formatting
Accelerated resolution
- Delivers precise issue details
- Eliminates ambiguity in problem reporting
- Dramatically reduces the time between issue creation and resolution
- Helps build the “virtuous cycle” of continuous improvement
Getting started
My suggestion is to consider a focused pilot:
- Choose a high-value cross-domain use case
- Implement basic validation rules
- Measure impact and effectiveness
- Build momentum for expansion
To be clear, this is an automation solution, you need a process design to create the feedback loop. The automation provides the ingredients to create a workable process, but the process is up to you.
Following this approach will reduce implementation risk while demonstrating tangible value to stakeholders. It will also create advocates for broader adoption across the enterprise.
Future possibilities
This foundation opens doors to advanced capabilities:
- Anomaly detection
- Sophisticated data profiling
- Workflow system integration
- Cross-platform quality controls
Because this concept provides a standard, metadata-driven approach, organizations can implement consistent quality controls across diverse data platforms and tools. This standardization extends beyond just the technical implementation – it creates a common language and shared understanding of data quality requirements and issues among all stakeholders.
Most importantly, it enables high-velocity feedback loops that were previously impossible. When data quality issues are caught early, described precisely, and communicated in a standard format, teams move from reactive firefighting to proactive quality management.
Teams can start simple and evolve to more sophisticated approaches while maintaining clear responsibilities and accelerating the feedback cycle.
Moving forward
The path to better data quality isn't just about technology – it's about creating sustainable, efficient processes that connect data producers and consumers. By implementing automated feedback loops at the point of delivery, organizations dramatically reduce the time and effort required to identify and resolve data quality issues.
The approach outlined here provides a framework that is:
- Technically feasible
- Organization-friendly
- Incrementally implementable
- Demonstrably valuable
Most importantly, it creates the foundation for continuous improvement in data quality across the enterprise.
Ready to learn more? Get your copy of The data doom loop: Why big corporations are failing a data management.
Download now