From Redshift to Snowflake at a Streaming Platform: A Case Study in Workload Isolation
A first-person account of leading a 6-month Redshift to Snowflake migration at a major streaming media company — what broke, what worked, and the workload isolation pattern that made it stick.
I was brought in to lead the data platform migration at a major content streaming platform. The brief looked simple on paper — get them off Amazon Redshift onto Snowflake — but the reality was a six-month engagement that touched every part of their analytics stack: ingestion, modeling, BI, ML feature pipelines, and the cost model itself. This is the case study I should have written six months ago. The client name stays anonymous, but every architectural decision below is real.
Why they could not stay on Redshift
The streaming workload pattern was eating Redshift alive. Three things were happening at once.
Concurrency. A few thousand daily Looker users plus ML feature backfills plus ad-hoc analyst SQL all hit the same cluster. By 10 AM PT every weekday the BI dashboards were stalling because a feature pipeline was running. They had tried Concurrency Scaling and WLM queues, but the pain kept resurfacing.
Cost predictability. They were on reserved nodes for predictability, then over-provisioning for peaks (Sunday-night content drops, big sports events). Spend was 30 to 40 percent above the workload they actually needed during normal weeks.
Operational drag. Every schema change required a maintenance window. Every cluster resize was a project. The data team was spending more time on Redshift cluster gymnastics than on shipping models.
The decision: Snowflake — but with workload isolation as the real reason
The temptation in any migration is to lift-and-shift. Pick a target, copy the data, repoint the dashboards. We deliberately did not do that. The whole reason Snowflake was the right answer for this client was the multi-warehouse architecture. If we lifted-and-shifted into a single warehouse we would have rebuilt the same noisy-neighbor problem on a more expensive substrate.
The architecture we landed on:
Four virtual warehouses, sized independently, with auto-suspend tuned to the workload. ETL_WH (Medium) runs scheduled DAGs and goes to sleep between them. BI_WH (Large, multi-cluster) handles the Looker fleet. ML_FEATURE_WH (XL) runs feature backfills and never touches BI traffic. ADHOC_WH (Small, 30s suspend) is the cheap warehouse for analyst exploration. Every warehouse queries the same centralized storage — no copies, no sync, just isolated compute.
Migration approach: phased over six months
Months 1 to 2 — Foundations. Stood up Snowflake, IAM, network policies. Built the dbt project from scratch (their existing stored-procedure-heavy modeling was a millstone). Set up Snowpipe Streaming for the playback event firehose so we could parallel-ingest from day 1.
Months 3 to 4 — Parallel run. All ingestion writes to both Redshift and Snowflake. Every dbt model has a Redshift equivalent. We compare query results nightly with a row-level diff harness. This is what catches the SQL dialect surprises — Redshift's LISTAGG ordering, DATE_TRUNC behavior on partial weeks, the way Redshift handles NULL in window functions.
Month 5 — BI cutover. Looker first, because dashboards have the highest visibility (and the most political cost if they break). We migrated explores model-by-model rather than all at once. Rolled back twice on specific dashboards where the diff harness flagged drift we had not understood.
Month 6 — ML and decommission. Feature pipelines moved last because they tolerate downtime better than user-facing dashboards. Decommissioned Redshift the week after the last ML pipeline cut over.
What was actually hardest
Stored procedures. They had around 120 Redshift stored procedures that handled ETL logic. Most translated mechanically to dbt models. Maybe 15 of them encoded business logic that nobody could explain. We had to interview tenured analysts to figure out what some of them did before we could rewrite them.
Looker semantic-layer drift. The Looker LookML had absorbed years of metric definitions and silent overrides. Several "the same" metrics actually had different definitions in three different explores. The migration forced a metric-canonicalization exercise we should probably have charged extra for.
Recommendation feature drift. The recommendation team's features were derived from session-level aggregations that depended on the old Redshift session-stitching logic. Re-implementing that in dbt models exposed a long-standing bug in their session definition that we then had to roll out behind a feature flag, very carefully, because changing it would change live recommendations.
What we got right
Parallel-run validation from day 1. The diff harness — automated query-by-query comparison between Redshift and Snowflake results — was the single highest-leverage piece of the project. It caught about a dozen drift bugs we would otherwise have shipped to BI users.
Workload isolation by warehouse, not by query hint. The temptation in WLM-style systems is to encode workload class in queries (priority hints, queue routing). We pushed all of that out to the warehouse selection layer in dbt and the BI tools. The result is that finance can read the warehouse bill and see exactly which workload is costing what — and right-size each one independently.
Time-travel for A/B test reproducibility. Streaming companies live and die by A/B tests, and Snowflake's time-travel meant we could run a stale A/B analysis as of a specific point in time without juggling table snapshots. This was a free upgrade we did not specifically design for.
Key results
I am going to stay vague on the dollar figures because they are the client's, but directionally: monthly spend went from "wildly variable, with reserved-instance overcommit" to predictable within ±10 percent. Concurrency complaints from BI users dropped to near-zero within two weeks of cutover. ML feature backfills now run on a dedicated XL warehouse without affecting analyst queries. Schema changes are a pull request, not a maintenance window.
What I would do differently
Set up cost monitors before cutover, not after. The worst case in any consumption-based system is "we cut over and now we are spending 3x what we expected because someone forgot a WHERE clause." Resource monitors should be live on day 1 of the parallel-run period.
Force the metric-canonicalization conversation up front. We discovered three different definitions of "active subscriber" mid-migration. That is the kind of thing that takes weeks of stakeholder alignment, and ideally happens before you are trying to validate query results.
Bring in domain owners earlier. The ML team could have flagged the session-stitching bug if we had pulled them into the project in month 1 instead of month 4. We left the heroics to the migration team when the people who knew the data best were one Slack message away.
Bottom line
A Redshift-to-Snowflake migration is not just a swap of warehouses. The leverage is in re-architecting the workload isolation, fixing the metric definitions, and getting governance into the pipeline rather than into a wiki. If you are about to start one — or you are six weeks in and starting to suspect you bit off more than you scoped — let's talk.
Frequently Asked Questions
Why did the streaming company choose Snowflake over BigQuery or Databricks?
Three reasons in their priority order: multi-warehouse workload isolation matched their concurrency problem, the per-second consumption model fixed the over-provisioning issue, and Snowpipe Streaming let us cut over the high-volume playback event ingestion without rewriting the upstream Kafka producers. BigQuery and Databricks were both viable, but Snowflake mapped most cleanly to the specific pain.
How did you handle Redshift-specific SQL during migration?
Two layers. SQLGlot for mechanical translation of straightforward queries. Manual review for stored procedures, window function edge cases (Redshift treats NULL in OVER clauses differently from Snowflake), and any SQL that referenced Redshift system tables. We caught the rest with the parallel-run diff harness.
What was the biggest risk during cutover?
Query result drift on customer-facing dashboards. Even small differences in metric definitions can become very public very fast at a streaming platform — exec dashboards drive content investment decisions. We mitigated by cutting over Looker explore-by-explore, with parallel-run diffs running for at least a week before each cutover, and a one-click rollback for each explore.
How long did the workload isolation pattern take to pay off?
Concurrency complaints dropped to near-zero within two weeks of full cutover. The cost predictability win took longer — about two months of tuning auto-suspend timings and warehouse sizes per workload to land in the ±10 percent monthly range.
Need help building your data platform?
At CData Consulting, we design, build, and operate modern data infrastructure for companies across North America. Whether you are planning a migration, optimizing costs, or building from scratch — let's talk.
Get data engineering insights delivered to your inbox
Join our newsletter for weekly insights on Snowflake, data architecture, and modern analytics.