← Back to documentation

July 20, 2025

Lessons Learned: Migrating to Event-Driven Architecture

Practical insights from moving a batch-heavy data platform to an event-driven model — what worked, what didn't, and what we'd do differently.

ArchitectureEvent-DrivenDistributed Systems

Why We Made the Move

Our data platform started as a collection of batch jobs. Every hour, cron-triggered scripts would pull data, transform it, and load results. It worked well enough at small scale, but as the business grew, the cracks showed:

  • Latency expectations changed. Stakeholders wanted fresher data — minutes, not hours.
  • Coupling was everywhere. Upstream changes broke downstream consumers regularly because everything was tightly integrated.
  • Scaling was painful. Adding new data consumers meant modifying existing pipelines.

Event-driven architecture promised to solve these problems. Reality was more nuanced.

The Migration Approach

We didn't do a big-bang migration. Instead, we introduced event-driven patterns alongside existing batch processes, migrating workloads incrementally.

Phase 1: Event Bus Introduction

  • Deployed an event broker as the central nervous system
  • Existing batch jobs published events as a side effect (dual-write pattern initially)
  • New consumers subscribed to events instead of polling databases

Phase 2: Producer Migration

  • Rewrote critical data producers to be event-native
  • Eliminated the dual-write pattern where possible
  • Added schema registry for event contracts

Phase 3: Consumer Optimization

  • Built stream processing for real-time aggregations
  • Maintained batch fallbacks for complex analytical workloads
  • Implemented event sourcing for audit-critical data

What Worked

Decoupling Was Real and Valuable

The biggest win was organizational, not just technical. Teams could build new features and data products without coordinating with upstream producers. A new dashboard? Subscribe to the relevant event streams. A new ML feature? Same thing. No tickets, no meetings, no pipeline modifications.

Schema Registry Was Essential

Enforcing schemas on events prevented the "garbage in" problem. When a producer tried to publish a breaking change, the registry rejected it. This forced explicit versioning conversations instead of silent breakage.

What Didn't Work (At First)

Exactly-Once Semantics Are Hard

We underestimated the complexity of exactly-once processing. Idempotent consumers were the practical answer — designing every consumer to handle duplicate events gracefully rather than trying to guarantee single delivery.

Debugging Got Harder

When data flows through batch pipelines, you can trace the lineage fairly easily. With event-driven flows, a single business event might trigger a cascade of downstream events. We had to invest in distributed tracing and correlation IDs to maintain observability.

Not Everything Should Be Event-Driven

We learned that some workloads are genuinely better as batch. Complex analytical aggregations that join dozens of datasets don't benefit from event-driven processing — they need the full dataset. We stopped trying to force everything into the event model.

Key Takeaways

  1. Migrate incrementally. The dual-write pattern is ugly but practical for bridging batch and event-driven worlds during transition.
  2. Invest in observability early. Distributed tracing, event lineage, and dead-letter queue monitoring are not optional.
  3. Design idempotent consumers. Assume every event might be delivered more than once.
  4. Keep batch where it makes sense. Event-driven is a tool, not a religion.

Outcome

The migration took about eight months for the core platform. Latency for key data products dropped from hourly to sub-minute. More importantly, the team's velocity for building new data products roughly doubled because they no longer needed to coordinate pipeline changes for every new consumer.