← Back to Blogs
Real-Time vs Batch Pipelines: Architecture Patterns for Data Teams
DATA ENGINEERING

Real-Time vs Batch Pipelines: Architecture Patterns for Data Teams

When should you stream and when should you batch? A practical comparison with architecture diagrams, code examples, and a hybrid pattern that gives you both.

March 1, 20269 min read

Every data team faces the same question: should we process this data in real-time or in batch? The answer is almost always "both" — but knowing which workloads go where is the difference between a system that scales and one that collapses under its own complexity.

At CData Consulting, we see this question on nearly every engagement. This post breaks down both architectures, shows when to use each, and presents the hybrid pattern we deploy for clients who need both speed and accuracy.

The Core Tradeoff

Real-time gives you speed but costs more in complexity, infrastructure, and debugging difficulty. Use cases include live dashboards, fraud detection, and real-time bidding (sub-second to sub-minute latency). Batch is simpler and cheaper but your data is always stale by at least one processing interval. Use cases include daily reports, ML training data, and billing reconciliation (hourly to daily latency). The right answer depends on one question: what is the cost of stale data for this use case?

Architecture 1: Batch Pipeline

The workhorse of data engineering. Data flows from source systems to a landing zone on S3 (raw files in Parquet or CSV), then through a Spark job for cleaning, enrichment, and aggregation, and finally to the serving layer (Iceberg tables queryable via Athena, Snowflake, or dashboards). The entire flow is orchestrated by an Airflow DAG — scheduled daily or hourly.

A typical Airflow DAG validates that source data exists, runs a Spark transformation with Iceberg catalog support and dynamic allocation, updates the Glue Catalog via a crawler, and performs data quality checks against expected row counts and drop thresholds.

When batch works best: reporting and analytics (dashboards refreshed daily or hourly), ML model training (models retrained on yesterday's data), billing and reconciliation (accuracy matters more than speed), backfills (reprocessing historical data after a schema change), and complex joins (joining 10 tables is straightforward in batch, painful in streaming).

Architecture 2: Real-Time Streaming Pipeline

Events flow continuously from sources (clickstream, IoT sensors, ad events, webhooks) into Kafka topics — a durable, partitioned, replayable log. Spark Structured Streaming reads from Kafka, parses JSON events, and performs windowed aggregations with watermarks to handle late-arriving data. Results are written to two sinks: a real-time serving layer (ClickHouse or Redis for dashboards) and S3 for batch reconciliation.

When streaming works best: real-time dashboards (live campaign performance, active user counts), fraud and anomaly detection (catch suspicious patterns before damage is done), real-time bidding (ad-tech needs sub-second decisions), operational monitoring (system health, SLA tracking), and event-driven triggers (send an alert when a campaign budget is exhausted).

Architecture 3: The Hybrid (What We Actually Deploy)

In practice, most organizations need both. The hybrid architecture uses Kafka as the central nervous system, with streaming and batch paths consuming the same events. The hot path runs Spark Structured Streaming with 5-minute windowed aggregations, writing to Redis (hot cache) and ClickHouse (real-time OLAP). The cold path uses Kafka Connect to land events on S3, then daily Spark jobs perform full recomputation with complex joins, writing to Iceberg tables queryable via Athena and BI tools.

Kafka is the single source of truth. Both paths consume from the same topics. The streaming path gives you speed (seconds). The batch path gives you accuracy (complete joins, reconciled numbers). When they disagree, batch wins — it is the system of record.

Practical example at an ad-tech company: The hot path reports "Campaign X has 12,847 impressions in the last hour, CTR is 2.3%, spend is $4,521" — powering the live campaign dashboard, updated every 30 seconds, approximately correct. The cold path reports "Campaign X had 12,892 impressions yesterday, CTR was 2.31%, spend was $4,538.42" — powering the daily report and billing, updated at 6 AM, exactly correct. The dashboard shows the hot path during the day. The next morning, batch numbers replace streaming approximations.

Decision Framework

To decide which path a workload belongs on, ask: Do you need it in under 1 minute? If yes, and approximate data is OK for now, stream only (live dashboards, monitoring). If yes but you need exact numbers later, stream + fix in batch (billing, reporting). If no, and the logic is simple, use micro-batch at 15-minute intervals (alerts, simple aggregations). If no and the logic is complex, use full batch daily or hourly (reporting, ML, complex joins).

Common Pitfalls

1. Streaming everything because it sounds impressive. Streaming adds operational complexity — checkpointing, watermarks, out-of-order events, exactly-once semantics. If daily freshness is fine, batch is 10x simpler to build and debug.

2. Not archiving streaming data. Always land streaming events to S3 (the cold path). Without an archive, you cannot backfill, debug, or retrain ML models on historical data.

3. Trusting streaming aggregates for billing. Streaming numbers are approximate by nature (late events, processing delays). Always reconcile with a batch job for anything financial.

4. Ignoring late-arriving data. Events arrive out of order. A click from 11:59 PM might arrive at 12:02 AM. Your watermark strategy determines whether you include or drop it. For batch, this is trivial — you reprocess the full day. For streaming, you need explicit watermark policies.

5. Building two completely separate pipelines. The hybrid pattern works because both paths share the same Kafka source and the same schema. If your streaming and batch pipelines read from different sources with different schemas, you are maintaining two systems with inevitable drift.

The Bottom Line

Do not choose between streaming and batch. Use Kafka as the backbone, stream what needs to be fast, batch what needs to be accurate, and reconcile where they overlap. This is the architecture pattern that scales from startup to enterprise — and it is what we deploy for clients who need both operational speed and analytical accuracy.

At CData Consulting, we design and build data pipelines that match your latency requirements — whether that is sub-second streaming or reliable daily batch. Let's talk about your architecture.

Frequently Asked Questions

When should I use streaming instead of batch?

Use streaming when the cost of stale data is high — live dashboards, fraud detection, real-time bidding, and operational monitoring. If daily or hourly freshness is acceptable (reporting, ML training, billing), batch is simpler, cheaper, and easier to debug.

What is the hybrid pipeline architecture?

The hybrid pattern uses Kafka as the central event log with two consumer paths: a hot path (Spark Structured Streaming for real-time aggregations) and a cold path (Kafka Connect to S3, then daily Spark batch jobs). Both paths consume the same events. The hot path gives speed, the cold path gives accuracy.

Why does batch win when streaming and batch numbers disagree?

Streaming aggregates are approximate — they may miss late-arriving events, have processing delays, or use windowed approximations. Batch jobs reprocess the complete dataset for a given period, including all late arrivals, and can perform complex multi-table joins. For anything financial or compliance-related, batch is the system of record.

How much more expensive is streaming compared to batch?

Streaming typically costs 3–5x more than batch for the same data volume. The cost comes from always-on compute (Spark Streaming clusters, Kafka brokers), operational complexity (monitoring checkpoints, handling failures, managing watermarks), and serving infrastructure (Redis, ClickHouse). The tradeoff is latency — seconds instead of hours.

Need help building your data platform?

At CData Consulting, we design, build, and operate modern data infrastructure for companies across North America. Whether you are planning a migration, optimizing costs, or building from scratch — let's talk.