Data Quality by Design
Why Data Management and Golden Records Are the Only Sustainable Foundation
Data quality is often treated as a downstream problem.
Something to fix after ingestion, after integration, after the damage is already visible.
In practice, this approach never scales.
According to Gartner, data quality refers to the usability and applicability of data for an organization’s priority use cases. When quality is addressed too late, data-driven initiatives struggle not because of missing tools, but because the underlying data cannot be trusted, explained, or governed.
Real data quality does not emerge from cleansing jobs or validation scripts added late in the pipeline.
It emerges from data management decisions, and it becomes measurable only when a Golden Record exists.
The Golden Record is not the goal.
It is the condition that makes data quality possible.
Data Quality Without a Golden Record Is Fragmented by Definition
In the absence of a Golden Record:
- Each source system enforces its own notion of validity
- The same attribute exists in multiple, incompatible formats
- Conflicts between sources are resolved implicitly, or not at all
- Corrections applied downstream never propagate upstream
The result is predictable: data quality becomes subjective, temporary, and impossible to audit.
A Golden Record changes the nature of the problem.
It introduces a stable point of reference where data quality can be evaluated consistently.
It introduces:
- Identity: what entity are we actually describing
- Consolidation: which value wins, and according to which rule
- Accountability: when a decision was made, with which evidence
Only at this point does data quality become:
- measurable
- versioned
- explainable
- improvable over time
Data Quality Is Contextual, Not Absolute
One of the most common mistakes in data quality initiatives is attempting to “improve everything”.
Gartner explicitly warns against this approach.
Not all data has the same business value or risk profile. Data quality efforts must be scoped around priority use cases.
This is where the Golden Record becomes strategic.
By centralizing the most critical entities, organizations can:
- focus quality controls where they matter most
- align data quality with business and regulatory risk
- avoid dispersing effort across low-impact datasets
Data quality is not about perfection.
It is about fitness for purpose, enforced consistently.
The Data Quality Dimensions That Matter in Practice
Many theoretical frameworks list a long set of dimensions.
Operational systems need fewer, but enforced rigorously.
In a Golden Record and MDM context, the dimensions that matter most are:
- Completeness: is essential information missing
- Accuracy: are values plausible and verifiable
- Consistency: are values coherent across fields and systems
- Uniqueness: is duplication controlled through identity resolution
- Timeliness: does the data reflect the current state of reality
Some frameworks also include accessibility and relevancy.
In practice, these are often outcomes of good data management rather than primary controls.
What matters is that dimensions are:
- explicitly defined
- measurable
- tied to executable rules
A Data Model That Treats Quality as a First-Class Citizen
A common anti-pattern is calculating data quality externally and discarding the evidence.
A sustainable architecture embeds quality inside the Golden Record itself.
The model must retain:
- the consolidated data
- the identity resolution context
- the quality evaluation
- the ruleset used
- the audit trail
Example: Golden Record with Embedded Data Quality
{
"_id": "gr:person:8f2a1c7e",
"entityType": "person",
"golden": {
"firstName": "Mario",
"lastName": "Noioso",
"birthDate": "1980-05-10",
"taxId": "NSSMRA80E10H501Z",
"emails": ["mario@example.com"],
"phones": ["+393497726264"],
"address": {
"street": "Via Luigi Gallo 15",
"city": "Rome",
"postalCode": "001xx",
"country": "IT"
}
},
"identity": {
"clusterId": "clu:3b91f1",
"sourceIds": [
{ "system": "crm", "id": "CRM-192833" },
{ "system": "billing", "id": "BILL-88211" }
]
},
"dataQuality": {
"overallScore": 0.92,
"grade": "A",
"dimensions": {
"completeness": 0.95,
"accuracy": 0.90,
"consistency": 0.93,
"uniqueness": 0.98,
"timeliness": 0.85
},
"issues": [
{
"ruleId": "DQ-CONS-IT-POSTAL",
"severity": "medium",
"field": "golden.address.postalCode",
"description": "Postal code not validated against national reference dataset"
}
],
"evaluatedAt": "2026-02-02T08:10:00Z",
"rulesetVersion": "dq-rules-v3.4"
},
"versioning": {
"version": 17,
"updatedAt": "2026-02-02T08:10:05Z"
}
}
This structure makes data quality:
- queryable
- explainable
- historically traceable
From Conceptual Controls to Executable Rules
Data quality only becomes operational when conceptual controls are translated into rules.
Completeness Control
{
"ruleId": "DQ-COMP-IDENTITY",
"dimension": "completeness",
"description": "At least one strong identifier must be present",
"predicate": {
"anyOf": [
{ "exists": "golden.taxId" },
{ "exists": "golden.nationalId" }
]
},
"scoreImpact": -0.15
}
Accuracy Control (Email Example)
{
"ruleId": "DQ-ACC-EMAIL",
"dimension": "accuracy",
"description": "Email must be syntactically valid",
"predicate": {
"regex": {
"field": "golden.emails[*]",
"pattern": "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$"
}
},
"scoreImpact": -0.05
}
Consistency Control
{
"ruleId": "DQ-CONS-POSTAL-IT",
"dimension": "consistency",
"description": "Italian addresses must have a valid postal code",
"predicate": {
"if": { "equals": { "field": "golden.address.country", "value": "IT" } },
"then": {
"regex": {
"field": "golden.address.postalCode",
"pattern": "^\\d{5}$"
}
}
},
"scoreImpact": -0.06
}
Identity Resolution as a Core Data Quality Mechanism
Uniqueness is not achieved through batch deduplication jobs.
It is achieved through identity resolution with explicit thresholds.
{
"ruleId": "DQ-UNIQ-MATCH",
"dimension": "uniqueness",
"matchModel": {
"blockingKeys": [
["golden.taxId"],
["golden.lastName", "golden.birthDate"]
],
"weights": {
"taxIdExact": 0.7,
"nameSimilarity": 0.2,
"birthDateExact": 0.1
},
"thresholds": {
"autoMerge": 0.92,
"manualReview": 0.80
}
}
}
Here, data quality directly governs merge behavior.
Profiling, Evaluation, and Explainability
Before rules are enforced, data must be understood.
Gartner highlights data profiling as a foundational step to:
- identify anomalies
- reveal hidden patterns
- expose structural inconsistencies
Profiling feeds quality evaluation, which produces events.
{
"eventType": "dataquality.evaluated",
"entityId": "person:8f2a1c7e",
"rulesetVersion": "dq-rules-v3.4",
"score": 0.92,
"issues": [
{
"ruleId": "DQ-CONS-IT-POSTAL",
"severity": "medium"
}
],
"evaluatedAt": "2026-02-02T08:10:00Z"
}
These events enable:
- traceability
- replay
- re-scoring with new rules
- regulatory audit
Measuring Data Quality Over Time
It drifts.
Organizations that do not measure quality cannot improve it.
A minimal KPI document looks like this:
{
"date": "2026-02-02",
"entityType": "person",
"metrics": {
"recordsEvaluated": 182340,
"averageScore": 0.889,
"publishRate": 0.91,
"reviewRate": 0.07,
"duplicateRate": 0.014
},
"topFailingRules": [
{ "ruleId": "DQ-CONS-IT-POSTAL", "count": 11230 },
{ "ruleId": "DQ-ACC-EMAIL", "count": 8840 }
]
}
This is where data governance becomes operational.
Final Perspective
Data quality cannot be bolted on.
It emerges when:
- identity is explicit
- consolidation is deterministic
- rules are versioned
- decisions are explainable
The Golden Record is not the end of the journey.
It is the point where data quality stops being aspirational and becomes engineering.