Why We Made the Move
Our data platform started as a collection of batch jobs. Every hour, cron-triggered scripts would pull data, transform it, and load results. It worked well enough at small scale, but as the business grew, the cracks showed:
- Latency expectations changed. Stakeholders wanted fresher data — minutes, not hours.
- Coupling was everywhere. Upstream changes broke downstream consumers regularly because everything was tightly integrated.
- Scaling was painful. Adding new data consumers meant modifying existing pipelines.
Event-driven architecture promised to solve these problems. Reality was more nuanced.
The Migration Approach
We didn't do a big-bang migration. Instead, we introduced event-driven patterns alongside existing batch processes, migrating workloads incrementally.
Phase 1: Event Bus Introduction
- Deployed an event broker as the central nervous system
- Existing batch jobs published events as a side effect (dual-write pattern initially)
- New consumers subscribed to events instead of polling databases
Phase 2: Producer Migration
- Rewrote critical data producers to be event-native
- Eliminated the dual-write pattern where possible
- Added schema registry for event contracts
Phase 3: Consumer Optimization
- Built stream processing for real-time aggregations
- Maintained batch fallbacks for complex analytical workloads
- Implemented event sourcing for audit-critical data
What Worked
Decoupling Was Real and Valuable
The biggest win was organizational, not just technical. Teams could build new features and data products without coordinating with upstream producers. A new dashboard? Subscribe to the relevant event streams. A new ML feature? Same thing. No tickets, no meetings, no pipeline modifications.
Schema Registry Was Essential
Enforcing schemas on events prevented the "garbage in" problem. When a producer tried to publish a breaking change, the registry rejected it. This forced explicit versioning conversations instead of silent breakage.
What Didn't Work (At First)
Exactly-Once Semantics Are Hard
We underestimated the complexity of exactly-once processing. Idempotent consumers were the practical answer — designing every consumer to handle duplicate events gracefully rather than trying to guarantee single delivery.
Debugging Got Harder
When data flows through batch pipelines, you can trace the lineage fairly easily. With event-driven flows, a single business event might trigger a cascade of downstream events. We had to invest in distributed tracing and correlation IDs to maintain observability.
Not Everything Should Be Event-Driven
We learned that some workloads are genuinely better as batch. Complex analytical aggregations that join dozens of datasets don't benefit from event-driven processing — they need the full dataset. We stopped trying to force everything into the event model.
Key Takeaways
- Migrate incrementally. The dual-write pattern is ugly but practical for bridging batch and event-driven worlds during transition.
- Invest in observability early. Distributed tracing, event lineage, and dead-letter queue monitoring are not optional.
- Design idempotent consumers. Assume every event might be delivered more than once.
- Keep batch where it makes sense. Event-driven is a tool, not a religion.
Outcome
The migration took about eight months for the core platform. Latency for key data products dropped from hourly to sub-minute. More importantly, the team's velocity for building new data products roughly doubled because they no longer needed to coordinate pipeline changes for every new consumer.