Which design pattern is better for data pipelines: batches or one at a time?

Question

I come from a software engineering background and have a firm knowledge of best design patterns in that world, but with data science I feel like I'm making elementary design pattern mistakes.

One thing I've noticed is that I often write code like this:

mediumInputArray => someFilterFn => someMapFn => someOtherMapFn => someOtherFilterFn => someFinalFn

All of those functions in that pipeline accept arrays, so method 2 doesn't start on the batch until method 1 finishes with all the array elements, and so one.

It seems often mediumInputArray evolves into hugeInputArray and I need to shard things to run on many machines. At that point dealing with batches becomes brittle and cumbersome.

Is it a better design pattern to just write a pipeline that operates on one atomic entity, and then have 1 job run from start to end?

What would be some design patterns that are related to this?

Is there a good book on advanced data science design patterns?

Thanks!

score 2 · Answer 1 · answered Oct 26 '24 at 15:24

Obviously there is no one-size-fits-all answer to this. The choice usually depends on the nature of the data, scalability, and latency requirements. Both patterns have their merits, but they come with trade-offs that should also be considered.

If you find that batches become cumbersome as data scales (leading to complexity in sharding and parallelism), then operating on one atomic entity at a time can often lead to better scalability. The core idea is to break down the processing into smaller, independent units that can run across distributed systems seamlessly, without the overhead of managing batch boundaries.

This approach aligns well with modern data engineering best practices. For instance, stream processing models offer low-latency and ease of distribution, which simplifies scaling across multiple machines (ie., horizontally). By operating on individual records, the pipeline becomes more robust to growing data loads and more straightforward to manage, especially in environments like distributed computing frameworks (eg., Apache Spark or Kafka). Stream-based processing is naturally compatible with horizontal scaling, allowing more graceful fault tolerance and easier error recovery mechanisms through real-time checkpointing and state management.

Two relevant patterns that spring to mind are the microservices pattern and Lambda and Kappa Architectures. In a microservices pattern, splitting the pipeline into independent, loosely-coupled services can increase flexibility and scalability. Each service can handle a single stage in the data pipeline, processing one entity at a time. This allows for scalable parallelism and simplifies orchestration across distributed systems.

If the use case involves both real-time and historical data processing, consider adopting a Lambda Architecture. For real-time-only requirements, the Kappa Architecture simplifies things by processing all data as streams. These patterns allow you to maintain low latency while processing large amounts of data

Implementing a one-at-a-time processing approach offers better scalability and flexibility, particularly when dealing with large and growing datasets. By focusing on stream processing and making use of design patterns like microservices or Kappa architecture, you can simplify pipeline scaling across distributed systems while maintaining efficiency. This approach reduces the operational complexity that comes with batch processing, especially as data pipelines scale out horizontally.

Which design pattern is better for data pipelines: batches or one at a time?

1 Answers1