How to Run a Data Quality Audit in 2026

Step-by-step data quality audit process with SQL queries, thresholds, remediation templates, and automation tips. Fix bad data, don't just report it.

9 min readProspeo Team

How to Run a Data Quality Audit That Actually Changes Something

A RevOps lead we know pulled a pipeline report for the board last quarter. It showed 847 open opportunities worth $14.2M. The problem? 312 of those were duplicates, 89 had no associated contact, and the "pipeline" was inflated by 40%. The board didn't ask about quota attainment that day. They asked why nobody caught this sooner.

That's what happens without a data quality audit. And it's far more common than anyone admits - 43% of COOs now rank data quality as their top data priority, over a quarter of organizations estimate they lose more than $5M annually to bad data, and 7% report losses exceeding $25M. With 45% of business leaders citing data quality as the leading barrier to scaling AI initiatives, audits aren't just about clean dashboards anymore. They're prerequisites for any AI strategy.

But here's the thing: most audits fail because they produce a PDF that nobody acts on. Let's fix that.

What You Need (Quick Version)

Your first audit needs exactly three deliverables:

  1. 10 SQL queries against your most business-critical dataset - nulls, duplicates, freshness, referential integrity, format validation
  2. A pass/fail spreadsheet with actual numbers and thresholds (not "some emails are missing" - write "15.3% null rate on email field, threshold is 5%, FAIL")
  3. A remediation plan with named owners, deadlines, and success criteria

Everything else is optimization. 56% of data teams cite data quality as their most pressing challenge, but most of them are stuck in the "we should really do something about this" phase. These three artifacts get you out of that phase in a week.

What a Data Quality Audit Actually Is

A data quality audit is a structured evaluation of a specific dataset against defined standards, with pass/fail judgments and a remediation plan attached. It's not a dashboard. It's not a vibe check. It's a repeatable process that produces evidence and assigns accountability.

Six core data quality dimensions with definitions and failure examples
Six core data quality dimensions with definitions and failure examples

People confuse three related but different things:

Audit Assessment Governance
Purpose Test specific data, fix failures Evaluate maturity/readiness Ongoing policy & ownership
Output Pass/fail results + remediation Maturity score + roadmap Standards, roles, processes
Cadence Quarterly + automated daily Annual or semi-annual Continuous
Who owns it Data/analytics engineer Data leadership Cross-functional

A standard audit framework measures data against six core dimensions:

Dimension Definition Failure Example
Accuracy Data reflects reality Revenue field says $50K; Stripe says $48.2K
Completeness Required fields are populated 15% of contacts missing email
Consistency Same entity, same value everywhere "Acme Inc" in CRM, "ACME" in billing
Timeliness Data is fresh enough for its use Lead scores based on 6-month-old firmographics
Validity Values conform to format/range rules Phone field contains "TBD"
Uniqueness No unintended duplicates 312 duplicate opportunities inflating pipeline

How to Scope Your B2B Data Audit

Don't audit your entire data warehouse at once. That's how audit programs die in week two.

Start with one dataset - the one closest to revenue. Categorize your datasets into three buckets:

  • Customer-facing: data that end users or customers see (dashboards, reports, product data)
  • Operational: data powering real-time or near-real-time processes (lead routing, billing, scoring)
  • Analytical: data feeding internal analysis and models

Customer-facing and operational datasets come first. These are where bad data causes visible damage - bounced emails, misrouted leads, wrong invoices, inflated pipeline numbers. A wrong internal chart is less urgent than a wrong invoice.

For your first audit, pick one table or one data domain. Run the full process end-to-end. Get a remediation win. Then expand scope. We've watched teams try to boil the ocean on audit one, and the result is always the same: a 47-page report that nobody reads and nothing changes.

Step-by-Step Audit Process

Define Metrics and Thresholds

Before you write a single query, decide what "good" looks like. Every check needs a metric and a threshold - otherwise you're just profiling, not auditing.

End-to-end data quality audit process in five steps
End-to-end data quality audit process in five steps
Check Metric Threshold Example
Null rate % null per column < 5% for required fields Email null rate: 2.3% PASS
Duplicate rate % duplicate rows < 1% Contact dupes: 3.7% FAIL
Freshness Hours since last update < 24 hours Last sync: 47 hrs ago FAIL
Reconciliation Source vs. target match > 99% 497/500 matched (99.4%) PASS
Format validity % matching regex > 98% Phone format: 96.1% FAIL

The key principle: document with numbers. "15.3% null rate on email field" is actionable. "Some emails are missing" is not. Every finding should be a number compared to a threshold with a pass or fail verdict.

Profile and Test Your Data

Here's where SQL does the heavy lifting. These queries cover common audit checks, adapted from patterns Dremio documents well.

NULL rate (single column):

SELECT
  COUNT(*) AS total_rows,
  COUNT(email) AS non_null_rows,
  COUNT(*) - COUNT(email) AS null_rows,
  ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 2) AS null_pct
FROM contacts;

Duplicate detection:

SELECT email, COUNT(*) AS dupes
FROM contacts
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY dupes DESC;

Foreign key integrity:

SELECT o.account_id
FROM opportunities o
WHERE o.account_id NOT IN (SELECT id FROM accounts)
  AND o.account_id IS NOT NULL;

Range validation:

SELECT id, annual_revenue
FROM companies
WHERE annual_revenue < 0 OR annual_revenue > 1e12;

Regex format check (Postgres):

SELECT id, phone
FROM contacts
WHERE phone !~ '^\+?[0-9\-\(\) ]{7,20}$';

For deeper profiling - min/max/avg/stdev, distinct counts, string length distributions - Dataedo's query library is a solid reference. If you're in Python, Pandas and Polars equivalents exist for every one of these checks.

Document Findings

Every audit needs a findings table. Here's what a filled example looks like:

Check Metric Threshold Result Status
Email null rate % null < 5% 2.3% PASS
Contact duplicates % duplicate < 1% 3.7% FAIL
Revenue reconciliation Source match % > 99% 99.4% PASS
Revenue accuracy Variance vs Stripe < 0.1% 0.02% PASS
Data freshness Hours since sync < 24 hrs 47 hrs FAIL
Phone format % valid format > 98% 96.1% FAIL

Track three operational KPIs over time to measure whether your audit program is actually improving things: number of incidents (N), time to detection (TTD), and time to resolution (TTR). If N goes down and TTD shrinks quarter over quarter, your audits are working.

Prospeo

Your audit just flagged a 15% email null rate and 3.7% duplicates. Now what? Prospeo's 300M+ profiles with 98% verified email accuracy and 7-day refresh cycles mean fewer nulls, fewer bounces, and fewer audit failures. Enrich your CRM in bulk - 83% of leads come back with verified contact data and 50+ data points.

Pass your next data quality audit before you even run it.

How to Audit CRM and GTM Data

CRM and go-to-market data has its own failure modes that generic frameworks miss entirely.

CRM data quality failure modes with impact severity
CRM data quality failure modes with impact severity

Bounce rates are the canary in the coal mine. If your SDRs are running sequences with a 12% email bounce rate, you're not just losing replies - you're burning domain reputation. Every bounced email tells ESPs your list is dirty, and once your domain gets flagged, even your good emails stop landing.

Stale contacts rot faster than most teams realize. If your CRM hasn't been enriched recently, a meaningful chunk of your contact data points to the wrong company, wrong title, or a deactivated inbox. The consensus on r/sales is that B2B contact data decays at roughly 30% per year, and in our experience that's conservative for fast-moving industries like tech.

Duplicate opportunities inflating pipeline is the 847-duplicate scenario from the intro. Merge rules that don't catch company name variants ("Acme Inc" vs "Acme, Inc." vs "ACME") create phantom pipeline that misleads the entire organization.

Misrouted leads and broken attribution follow from stale territory fields and wrong industry codes. Leads route to the wrong rep, marketing can't measure what's working, and everyone blames each other.

When the audit reveals stale emails and dead phone numbers - and it will - the fix is re-verification and enrichment. Prospeo handles this at scale with 98% email accuracy, a 7-day data refresh cycle versus the 6-week industry average, and CRM enrichment that returns 50+ data points per contact. That's the difference between knowing your data is bad and actually fixing it.

Build Your Remediation Plan

An audit without a remediation plan is just audit theater. A PDF that makes everyone feel productive while nothing changes.

Remediation plan template with required fields and structure
Remediation plan template with required fields and structure

Your remediation plan needs these fields for every failing check:

Field Example
Issue Contact duplicate rate 3.7% (threshold: <1%)
Root cause No dedup rule on web form imports
Downstream impact Inflated pipeline by ~$2.1M
Action Implement fuzzy match dedup on import
Owner Sarah Chen, RevOps
Deadline 2026-03-15
Success criteria Duplicate rate < 1% for 30 days
Verification Re-run duplicate query on Apr 15

The structure matters: executive summary, scope and sources, dimensions with metrics and thresholds, findings with root causes, and recommendations with an action plan that has named owners and deadlines. Skip any of those elements and the report collects dust.

Automating Audits in Your Pipeline

Manual audits are necessary for the first pass. But if you're still running SQL queries by hand every quarter, you're leaving gaps between audits where data rots undetected.

Automated data quality audit pipeline architecture diagram
Automated data quality audit pipeline architecture diagram

dbt-expectations

dbt's native tests cover the basics - unique, not_null, accepted_values, relationships. That's four checks. You need more.

dbt-expectations is a free, open-source package that extends dbt with Great Expectations-style assertions you define in YAML. Thousands of teams use it in production.

Freshness check:

- dbt_expectations.expect_grouped_row_values_to_have_recent_data:
    group_by: [source_system]
    timestamp_column: updated_at
    datepart: hour
    interval: 24

Completeness across time buckets:

- dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
    date_col: created_date
    date_part: day
    interval: 1
    test_start_date: "2026-01-01"
    test_end_date: "2026-03-01"

These run on every dbt build, which means your audit checks execute every time your pipeline runs. Issues surface in hours, not quarters.

Great Expectations + Airflow

For teams not on dbt, Great Expectations is the standard alternative. Install with pip install great_expectations, initialize with gx.get_context(mode="file") to create your project structure, then define expectation suites as JSON - each suite is a collection of checks against a specific dataset. Integrate with Airflow via airflow-provider-great-expectations (requires Python 3.10+, GX 1.7.0+, Airflow 2.1+). Suites run as Airflow tasks, failing the DAG when checks fail.

This is the right path for teams running Spark, Airflow, or custom Python pipelines where dbt isn't in the stack.

Audit Tools Compared

Let's be honest: most teams don't need a dedicated data quality tool. Start with SQL and dbt tests. Prove the process manually. Then decide if you need software.

That said, here's the field when you outgrow the basics:

Tool Type AI/ML Best For Est. Pricing
Great Expectations OSS / Code No Python/Airflow teams Free; Cloud ~$1-5K/mo
dbt-expectations OSS / Code No dbt shops Free; dbt Cloud $100+/mo
Prospeo Proprietary / No-Code No B2B contact verification & enrichment Free tier; ~$0.01/email
Soda Core OSS / Partial No-Code No YAML-first checks Free; Cloud ~$500/mo
Deequ OSS / Code No Spark-based pipelines Free (AWS/Spark)
Datafold Proprietary / Partial No Diff testing in CI ~$500/mo+
Monte Carlo Proprietary / No-Code Yes Enterprise (100+ tables) ~$50-150K+/yr
Anomalo Proprietary / No-Code Yes Auto-detect anomalies at scale ~$30-100K+/yr
Bigeye Proprietary / No-Code Yes Mid-market monitoring ~$30-80K+/yr
Collibra Proprietary / No-Code Yes Governance + quality combined ~$50-200K+/yr

General observability tools like Monte Carlo and Anomalo are excellent at catching schema changes, volume anomalies, and distribution drift across your warehouse. But when your audit reveals that contact data is the problem - stale emails, invalid phone numbers, missing firmographics - those tools can't fix it. That's a different kind of quality issue, and it needs a different kind of tool.

Skip the enterprise platforms if your average deal size is under $50K and your warehouse has fewer than 50 tables. SQL plus dbt-expectations will cover you until you genuinely outgrow them.

Prospeo

Bad data costs organizations $5M+ per year. Every failed freshness check traces back to stale records from providers refreshing every 6 weeks. Prospeo refreshes all 300M+ profiles every 7 days with 5-step verification, catch-all handling, and spam-trap removal - so your pipeline numbers reflect reality, not ghosts.

Replace your biggest audit failure with data that stays accurate weekly.

Pitfalls That Kill Audit Programs

We've seen five patterns destroy audit programs before they produce a single remediation win.

Scope creep is the most common killer. Auditing everything at once means finishing nothing. One dataset, end-to-end, with a remediation win - then expand.

No ownership turns findings into suggestions. And suggestions get ignored. Every failing check needs a person and a deadline in the remediation plan. No exceptions.

Alert fatigue happens when teams set thresholds too tight, get 200 alerts a week, and start ignoring all of them. Set thresholds at levels that represent actual business risk, not theoretical perfection. If your null rate threshold is 0.1% on a non-critical field, you're going to drown in noise.

Audit theater is the quarterly report that gets produced, presented, filed, and forgotten. If your audit doesn't change a process, a pipeline, or a dataset within 30 days of completion, it's theater. Once your process is repeatable, formalize maturity levels - Initial, Repeatable, Optimized - so leadership can track progression and hold teams accountable.

Tool over-investment is buying a $100K observability platform before you've run your first SQL-based audit. That's like buying a race car before you have a driver's license. Start with SQL. Graduate to dbt tests. Buy tooling when you've outgrown the basics.

FAQ

How often should you run a data quality audit?

Run a full manual audit quarterly on your most critical datasets. Between audits, automate daily checks for nulls, duplicates, and freshness using dbt-expectations or Great Expectations. The manual audit catches structural issues and new failure modes; the automated checks catch day-to-day drift.

What's the difference between an audit and data profiling?

Data profiling is the discovery step - understanding column distributions, null rates, and value ranges. An audit includes profiling but adds thresholds, pass/fail judgments, root cause analysis, and a remediation plan with named owners. Profiling tells you what your data looks like. An audit tells you whether it's good enough and what to fix.

How do you audit B2B contact data in a CRM?

Check email bounce rates, phone number validity, duplicate contact records, and freshness - when was each record last verified? Also examine firmographic accuracy and job title currency, since those fields drive lead scoring and routing. For records that fail, re-verify through a platform with a short refresh cycle so you're replacing stale data with genuinely current records, not slightly-less-stale ones.

What tools do I need for my first audit?

A SQL client and a spreadsheet. Run profiling and validation queries against your database, document pass/fail results with thresholds, and build a remediation plan with owners and deadlines. Add dbt-expectations or Great Expectations when you're ready to automate. Don't buy enterprise tooling until you've proven the process works manually.

Who should own the audit framework in a GTM org?

RevOps is the natural owner for go-to-market data audits because they sit at the intersection of sales, marketing, and customer success systems. They have the cross-functional visibility to define thresholds that reflect actual business impact and the authority to enforce remediation deadlines across teams.

B2B Data Platform

Verified data. Real conversations.Predictable pipeline.

Build targeted lead lists, find verified emails & direct dials, and export to your outreach tools. Self-serve, no contracts.

  • Build targeted lists with 30+ search filters
  • Find verified emails & mobile numbers instantly
  • Export straight to your CRM or outreach tool
  • Free trial — 100 credits/mo, no credit card
Create Free Account100 free credits/mo · No credit card
300M+
Profiles
98%
Email Accuracy
125M+
Mobiles
~$0.01
Per Email