How to Run a Data Quality Audit in 2026

How to Run a Data Quality Audit That Actually Changes Something

A RevOps lead we know pulled a pipeline report for the board last quarter. It showed 847 open opportunities worth $14.2M. The problem? 312 of those were duplicates, 89 had no associated contact, and the "pipeline" was inflated by 40%. The board didn't ask about quota attainment that day. They asked why nobody caught this sooner.

That's what happens without a data quality audit. And it's far more common than anyone admits - 43% of COOs now rank data quality as their top data priority, over a quarter of organizations estimate they lose more than $5M annually to bad data, and 7% report losses exceeding $25M. With 45% of business leaders citing data quality as the leading barrier to scaling AI initiatives, audits aren't just about clean dashboards anymore. They're prerequisites for any AI strategy.

But here's the thing: most audits fail because they produce a PDF that nobody acts on. Let's fix that.

What You Need (Quick Version)

Your first audit needs exactly three deliverables:

10 SQL queries against your most business-critical dataset - nulls, duplicates, freshness, referential integrity, format validation
A pass/fail spreadsheet with actual numbers and thresholds (not "some emails are missing" - write "15.3% null rate on email field, threshold is 5%, FAIL")
A remediation plan with named owners, deadlines, and success criteria

Everything else is optimization. 56% of data teams cite data quality as their most pressing challenge, but most of them are stuck in the "we should really do something about this" phase. These three artifacts get you out of that phase in a week.

What a Data Quality Audit Actually Is

A data quality audit is a structured evaluation of a specific dataset against defined standards, with pass/fail judgments and a remediation plan attached. It's not a dashboard. It's not a vibe check. It's a repeatable process that produces evidence and assigns accountability.

Six core data quality dimensions with definitions and failure examples

People confuse three related but different things:

	Audit	Assessment	Governance
Purpose	Test specific data, fix failures	Evaluate maturity/readiness	Ongoing policy & ownership
Output	Pass/fail results + remediation	Maturity score + roadmap	Standards, roles, processes
Cadence	Quarterly + automated daily	Annual or semi-annual	Continuous
Who owns it	Data/analytics engineer	Data leadership	Cross-functional

A standard audit framework measures data against six core dimensions:

Dimension	Definition	Failure Example
Accuracy	Data reflects reality	Revenue field says $50K; Stripe says $48.2K
Completeness	Required fields are populated	15% of contacts missing email
Consistency	Same entity, same value everywhere	"Acme Inc" in CRM, "ACME" in billing
Timeliness	Data is fresh enough for its use	Lead scores based on 6-month-old firmographics
Validity	Values conform to format/range rules	Phone field contains "TBD"
Uniqueness	No unintended duplicates	312 duplicate opportunities inflating pipeline

How to Scope Your B2B Data Audit

Don't audit your entire data warehouse at once. That's how audit programs die in week two.

Start with one dataset - the one closest to revenue. Categorize your datasets into three buckets:

Customer-facing: data that end users or customers see (dashboards, reports, product data)
Operational: data powering real-time or near-real-time processes (lead routing, billing, scoring)
Analytical: data feeding internal analysis and models

Customer-facing and operational datasets come first. These are where bad data causes visible damage - bounced emails, misrouted leads, wrong invoices, inflated pipeline numbers. A wrong internal chart is less urgent than a wrong invoice.

For your first audit, pick one table or one data domain. Run the full process end-to-end. Get a remediation win. Then expand scope. We've watched teams try to boil the ocean on audit one, and the result is always the same: a 47-page report that nobody reads and nothing changes.

Step-by-Step Audit Process

Define Metrics and Thresholds

Before you write a single query, decide what "good" looks like. Every check needs a metric and a threshold - otherwise you're just profiling, not auditing.

End-to-end data quality audit process in five steps

Check	Metric	Threshold	Example
Null rate	% null per column	< 5% for required fields	Email null rate: 2.3% PASS
Duplicate rate	% duplicate rows	< 1%	Contact dupes: 3.7% FAIL
Freshness	Hours since last update	< 24 hours	Last sync: 47 hrs ago FAIL
Reconciliation	Source vs. target match	> 99%	497/500 matched (99.4%) PASS
Format validity	% matching regex	> 98%	Phone format: 96.1% FAIL

The key principle: document with numbers. "15.3% null rate on email field" is actionable. "Some emails are missing" is not. Every finding should be a number compared to a threshold with a pass or fail verdict.

Profile and Test Your Data

Here's where SQL does the heavy lifting. These queries cover common audit checks, adapted from patterns Dremio documents well.

NULL rate (single column):

SELECT
  COUNT(*) AS total_rows,
  COUNT(email) AS non_null_rows,
  COUNT(*) - COUNT(email) AS null_rows,
  ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 2) AS null_pct
FROM contacts;

Duplicate detection:

SELECT email, COUNT(*) AS dupes
FROM contacts
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY dupes DESC;

Foreign key integrity:

SELECT o.account_id
FROM opportunities o
WHERE o.account_id NOT IN (SELECT id FROM accounts)
  AND o.account_id IS NOT NULL;

Range validation:

SELECT id, annual_revenue
FROM companies
WHERE annual_revenue < 0 OR annual_revenue > 1e12;

Regex format check (Postgres):

SELECT id, phone
FROM contacts
WHERE phone !~ '^\+?[0-9\-\(\) ]{7,20}$';

For deeper profiling - min/max/avg/stdev, distinct counts, string length distributions - Dataedo's query library is a solid reference. If you're in Python, Pandas and Polars equivalents exist for every one of these checks.

Document Findings

Every audit needs a findings table. Here's what a filled example looks like:

Check	Metric	Threshold	Result	Status
Email null rate	% null	< 5%	2.3%	PASS
Contact duplicates	% duplicate	< 1%	3.7%	FAIL
Revenue reconciliation	Source match %	> 99%	99.4%	PASS
Revenue accuracy	Variance vs Stripe	< 0.1%	0.02%	PASS
Data freshness	Hours since sync	< 24 hrs	47 hrs	FAIL
Phone format	% valid format	> 98%	96.1%	FAIL

Track three operational KPIs over time to measure whether your audit program is actually improving things: number of incidents (N), time to detection (TTD), and time to resolution (TTR). If N goes down and TTD shrinks quarter over quarter, your audits are working.

Your audit just flagged a 15% email null rate and 3.7% duplicates. Now what? Prospeo's 300M+ profiles with 98% verified email accuracy and 7-day refresh cycles mean fewer nulls, fewer bounces, and fewer audit failures. Enrich your CRM in bulk - 83% of leads come back with verified contact data and 50+ data points.

Pass your next data quality audit before you even run it.

Enrich Your CRM Now Contact Sales

How to Audit CRM and GTM Data

CRM and go-to-market data has its own failure modes that generic frameworks miss entirely.

CRM data quality failure modes with impact severity

Bounce rates are the canary in the coal mine. If your SDRs are running sequences with a 12% email bounce rate, you're not just losing replies - you're burning domain reputation. Every bounced email tells ESPs your list is dirty, and once your domain gets flagged, even your good emails stop landing.

Stale contacts rot faster than most teams realize. If your CRM hasn't been enriched recently, a meaningful chunk of your contact data points to the wrong company, wrong title, or a deactivated inbox. The consensus on r/sales is that B2B contact data decays at roughly 30% per year, and in our experience that's conservative for fast-moving industries like tech.

Duplicate opportunities inflating pipeline is the 847-duplicate scenario from the intro. Merge rules that don't catch company name variants ("Acme Inc" vs "Acme, Inc." vs "ACME") create phantom pipeline that misleads the entire organization.

Misrouted leads and broken attribution follow from stale territory fields and wrong industry codes. Leads route to the wrong rep, marketing can't measure what's working, and everyone blames each other.

When the audit reveals stale emails and dead phone numbers - and it will - the fix is re-verification and enrichment. Prospeo handles this at scale with 98% email accuracy, a 7-day data refresh cycle versus the 6-week industry average, and CRM enrichment that returns 50+ data points per contact. That's the difference between knowing your data is bad and actually fixing it.

Build Your Remediation Plan

An audit without a remediation plan is just audit theater. A PDF that makes everyone feel productive while nothing changes.

Remediation plan template with required fields and structure

Your remediation plan needs these fields for every failing check:

Field	Example
Issue	Contact duplicate rate 3.7% (threshold: <1%)
Root cause	No dedup rule on web form imports
Downstream impact	Inflated pipeline by ~$2.1M
Action	Implement fuzzy match dedup on import
Owner	Sarah Chen, RevOps
Deadline	2026-03-15
Success criteria	Duplicate rate < 1% for 30 days
Verification	Re-run duplicate query on Apr 15

The structure matters: executive summary, scope and sources, dimensions with metrics and thresholds, findings with root causes, and recommendations with an action plan that has named owners and deadlines. Skip any of those elements and the report collects dust.

Automating Audits in Your Pipeline

Manual audits are necessary for the first pass. But if you're still running SQL queries by hand every quarter, you're leaving gaps between audits where data rots undetected.

Automated data quality audit pipeline architecture diagram

dbt-expectations

dbt's native tests cover the basics - unique, not_null, accepted_values, relationships. That's four checks. You need more.

dbt-expectations is a free, open-source package that extends dbt with Great Expectations-style assertions you define in YAML. Thousands of teams use it in production.

Freshness check:

- dbt_expectations.expect_grouped_row_values_to_have_recent_data:
    group_by: [source_system]
    timestamp_column: updated_at
    datepart: hour
    interval: 24

Completeness across time buckets:

- dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
    date_col: created_date
    date_part: day
    interval: 1
    test_start_date: "2026-01-01"
    test_end_date: "2026-03-01"

These run on every dbt build, which means your audit checks execute every time your pipeline runs. Issues surface in hours, not quarters.

Great Expectations + Airflow

For teams not on dbt, Great Expectations is the standard alternative. Install with pip install great_expectations, initialize with gx.get_context(mode="file") to create your project structure, then define expectation suites as JSON - each suite is a collection of checks against a specific dataset. Integrate with Airflow via airflow-provider-great-expectations (requires Python 3.10+, GX 1.7.0+, Airflow 2.1+). Suites run as Airflow tasks, failing the DAG when checks fail.

This is the right path for teams running Spark, Airflow, or custom Python pipelines where dbt isn't in the stack.

Audit Tools Compared

Let's be honest: most teams don't need a dedicated data quality tool. Start with SQL and dbt tests. Prove the process manually. Then decide if you need software.

That said, here's the field when you outgrow the basics:

Tool	Type	AI/ML	Best For	Est. Pricing
Great Expectations	OSS / Code	No	Python/Airflow teams	Free; Cloud ~$1-5K/mo
dbt-expectations	OSS / Code	No	dbt shops	Free; dbt Cloud $100+/mo
Prospeo	Proprietary / No-Code	No	B2B contact verification & enrichment	Free tier; ~$0.01/email
Soda Core	OSS / Partial No-Code	No	YAML-first checks	Free; Cloud ~$500/mo
Deequ	OSS / Code	No	Spark-based pipelines	Free (AWS/Spark)
Datafold	Proprietary / Partial	No	Diff testing in CI	~$500/mo+
Monte Carlo	Proprietary / No-Code	Yes	Enterprise (100+ tables)	~$50-150K+/yr
Anomalo	Proprietary / No-Code	Yes	Auto-detect anomalies at scale	~$30-100K+/yr
Bigeye	Proprietary / No-Code	Yes	Mid-market monitoring	~$30-80K+/yr
Collibra	Proprietary / No-Code	Yes	Governance + quality combined	~$50-200K+/yr

General observability tools like Monte Carlo and Anomalo are excellent at catching schema changes, volume anomalies, and distribution drift across your warehouse. But when your audit reveals that contact data is the problem - stale emails, invalid phone numbers, missing firmographics - those tools can't fix it. That's a different kind of quality issue, and it needs a different kind of tool.

Skip the enterprise platforms if your average deal size is under $50K and your warehouse has fewer than 50 tables. SQL plus dbt-expectations will cover you until you genuinely outgrow them.

Bad data costs organizations $5M+ per year. Every failed freshness check traces back to stale records from providers refreshing every 6 weeks. Prospeo refreshes all 300M+ profiles every 7 days with 5-step verification, catch-all handling, and spam-trap removal - so your pipeline numbers reflect reality, not ghosts.

Replace your biggest audit failure with data that stays accurate weekly.

Start Free With 75 Credits Contact Sales

Pitfalls That Kill Audit Programs

We've seen five patterns destroy audit programs before they produce a single remediation win.

Scope creep is the most common killer. Auditing everything at once means finishing nothing. One dataset, end-to-end, with a remediation win - then expand.

No ownership turns findings into suggestions. And suggestions get ignored. Every failing check needs a person and a deadline in the remediation plan. No exceptions.

Alert fatigue happens when teams set thresholds too tight, get 200 alerts a week, and start ignoring all of them. Set thresholds at levels that represent actual business risk, not theoretical perfection. If your null rate threshold is 0.1% on a non-critical field, you're going to drown in noise.

Audit theater is the quarterly report that gets produced, presented, filed, and forgotten. If your audit doesn't change a process, a pipeline, or a dataset within 30 days of completion, it's theater. Once your process is repeatable, formalize maturity levels - Initial, Repeatable, Optimized - so leadership can track progression and hold teams accountable.

Tool over-investment is buying a $100K observability platform before you've run your first SQL-based audit. That's like buying a race car before you have a driver's license. Start with SQL. Graduate to dbt tests. Buy tooling when you've outgrown the basics.

FAQ

How often should you run a data quality audit?

Run a full manual audit quarterly on your most critical datasets. Between audits, automate daily checks for nulls, duplicates, and freshness using dbt-expectations or Great Expectations. The manual audit catches structural issues and new failure modes; the automated checks catch day-to-day drift.

What's the difference between an audit and data profiling?

Data profiling is the discovery step - understanding column distributions, null rates, and value ranges. An audit includes profiling but adds thresholds, pass/fail judgments, root cause analysis, and a remediation plan with named owners. Profiling tells you what your data looks like. An audit tells you whether it's good enough and what to fix.

How do you audit B2B contact data in a CRM?

Check email bounce rates, phone number validity, duplicate contact records, and freshness - when was each record last verified? Also examine firmographic accuracy and job title currency, since those fields drive lead scoring and routing. For records that fail, re-verify through a platform with a short refresh cycle so you're replacing stale data with genuinely current records, not slightly-less-stale ones.

What tools do I need for my first audit?

A SQL client and a spreadsheet. Run profiling and validation queries against your database, document pass/fail results with thresholds, and build a remediation plan with owners and deadlines. Add dbt-expectations or Great Expectations when you're ready to automate. Don't buy enterprise tooling until you've proven the process works manually.

Who should own the audit framework in a GTM org?

RevOps is the natural owner for go-to-market data audits because they sit at the intersection of sales, marketing, and customer success systems. They have the cross-functional visibility to define thresholds that reflect actual business impact and the authority to enforce remediation deadlines across teams.