Data Pipeline Observability: Catching Silent Failures Before Your Stakeholders Do
Your pipeline succeeded. Your dashboard is wrong. Here is how to build an observability layer that catches data quality issues before they reach production.
The most dangerous data pipeline failure is one that does not look like a failure.
The Airflow DAG shows green. The Spark job completed in 12 minutes. The data landed in the table. Everyone goes home. The next morning, the CFO opens the revenue dashboard and the numbers are 40% lower than expected. A frantic Slack thread begins.
What happened? An upstream system changed a column name. The join silently returned zero matches instead of erroring. The pipeline wrote empty results to the serving table. Every downstream check passed because technically, the pipeline succeeded.
At CData Consulting, we have seen this pattern at nearly every client we work with. The fix is not more tests — it is a fundamentally different approach to observability that treats data as a first-class citizen alongside infrastructure.
The Three Layers of Pipeline Observability
Most teams only monitor Layer 1. The teams that sleep through the night monitor all three. Layer 1: Infrastructure Observability — CPU, memory, disk, pod health, Spark executor status, Kafka consumer lag. This tells you the machine is running. Layer 2: Pipeline Observability — execution time, rows processed, data volume in vs out, join hit rates. This tells you the job worked. Layer 3: Data Quality Observability — row counts, distributions, freshness, schema drift, business rule validation. This tells you the data is right.
Without Layer 3, you can have green dashboards and wrong numbers.
Quality Gates: Stop Bad Data at the Source
We build quality checks directly into the pipeline as gates — the pipeline halts if a check fails, preventing bad data from reaching production.
Gate 1: Schema Validation at Ingestion. Validate that incoming data matches expected schema. This catches upstream column renames, type changes, and missing fields. Compare actual column names and types against expected, raise an error on mismatches, and log warnings for new unexpected columns (which are worth knowing about but should not block the pipeline).
Gate 2: Volume Anomaly Detection. Compare today's row count against the trailing 7-day average. If the count drops more than 30%, halt the pipeline and alert. If the count is zero, that is always an error — a pipeline that writes zero rows should fail, not succeed silently.
Gate 3: Freshness Monitoring. Verify that the table has been updated within the SLA window. With Iceberg, you can check snapshot metadata directly — no need to scan data. If the table is staler than the defined threshold (for example, 2 hours), raise a freshness alert.
Gate 4: Distribution Drift Detection. Check that key column distributions have not shifted dramatically. For example, if the US segment suddenly disappears from the data or drops below a minimum expected row count, flag it. This catches issues like a country filter being accidentally applied upstream.
The Observable Pipeline: Putting It All Together
All four gates integrate into a single Airflow DAG: Ingest (land raw data) → Gate 1 (validate schema — fail: halt + alert) → Transform (clean, enrich, aggregate) → Gate 2 (validate volume — fail: halt + alert) → Load (write to Iceberg) → Gates 3 & 4 (validate freshness + distribution — fail: halt + rollback) → Publish Metrics (Grafana dashboards).
The critical detail: when a post-load validation fails, we use Iceberg's time travel to rollback to the last known good snapshot. The serving table is never left in a bad state. Consumers either see fresh-and-correct data, or yesterday's data with a freshness alert — never wrong data.
The Metrics Dashboard
Every CData pipeline deployment includes a Grafana dashboard tracking core metrics: data freshness per table with SLA status, volume trends over the trailing 7 days, pipeline latency broken down by stage (ingest, transform, load, validate), and quality check status (schema, volume, nulls, uniqueness, distribution, freshness) with pass/fail history.
Five Rules for Data Observability
1. Zero-row output is always an error. A pipeline that writes zero rows should fail, not succeed silently. This one rule alone would prevent the majority of data incidents we have seen.
2. Monitor data, not just infrastructure. Your Kubernetes cluster can be perfectly healthy while your pipeline writes garbage to the serving table. Layer 3 observability is non-negotiable.
3. Use relative thresholds, not absolute. Do not alert on "fewer than 1 million rows." Alert on "30% fewer rows than the trailing 7-day average." Absolute thresholds break whenever your data naturally grows or shrinks.
4. Build rollback into the architecture. Iceberg's snapshot isolation makes this trivial. When a quality gate fails post-load, roll back to the previous snapshot. Your consumers see stale-but-correct data while you investigate. This is infinitely better than wrong data in production.
5. Alert the right person with the right context. "Pipeline failed" is useless. "Table fact_ad_events has 0 rows for 2026-03-01 after transform. Expected ~1.2M based on 7-day average. Last successful run: 23 hours ago. Possible cause: upstream schema change in raw.ad_events (new column device_category detected)." That is actionable.
The Cost of Not Doing This
We have worked with teams that discovered data quality issues through their stakeholders — the CFO asking "why are revenue numbers wrong?" or a campaign manager noticing "the dashboard shows zero clicks for the US." The cost is not just the bug fix. It is the trust deficit. Once stakeholders lose confidence in the data, they start maintaining their own spreadsheets, double-checking every number, and building shadow analytics. Rebuilding that trust takes months.
Investing in observability upfront is orders of magnitude cheaper than rebuilding trust after a data incident.
At CData Consulting, we build observable data pipelines from day one — not as an afterthought. If your data team is fighting silent failures and trust issues, let's talk about building the right observability layer for your stack.
Frequently Asked Questions
What is a silent pipeline failure?
A silent failure is when a data pipeline completes successfully (no errors, green status in Airflow) but produces incorrect or incomplete data. Common causes include upstream schema changes that cause joins to return zero matches, empty source files that result in zero-row outputs, and data type changes that cause silent truncation or casting errors.
What are data quality gates?
Quality gates are validation checks embedded directly into the pipeline that halt execution if they fail. Unlike post-hoc testing, gates prevent bad data from ever reaching the serving layer. Common gates include schema validation, volume anomaly detection, freshness checks, and distribution drift detection.
How does Iceberg time travel help with data quality?
When a post-load quality gate fails, you can use Iceberg rollback to revert the table to its previous good snapshot. This means the serving table is never left in a bad state — consumers see stale-but-correct data while the team investigates. Without time travel, a bad write can corrupt the production table with no easy way to undo it.
What tools do you recommend for data observability?
We use Prometheus and Grafana for metrics and dashboards, PagerDuty or Slack for alerting, and OpenLineage with Marquez for data lineage. For data quality checks, we build custom gates in Python rather than relying on a single vendor tool — this gives more control and avoids another SaaS dependency in the critical path.
Need help building your data platform?
At CData Consulting, we design, build, and operate modern data infrastructure for companies across North America. Whether you are planning a migration, optimizing costs, or building from scratch — let's talk.