Big Data Pipelines using Open Source Technologies and Public Cloud

Data pipelines are crucial components of any big data applications. These are software that handles data streaming and batch processing, whereby data undergoes various transformations along the way.

This blog describes various big data streaming/batch processing options available with private clusters leveraging open source technologies and with serverless public cloud infrastructures like AWS.

Option 1: Serverless Architecture with Kinesis, Lambda & DynamoDB

The biggest advantage of the serverless architecture is that the services run on a managed infrastructure from the provider (for this instance, let’s take AWS).

AWS managed services

The above diagram focuses on AWS managed services listed below:

API Gateway – Provides end-point for client on-premise installations to push data.

Kinesis – A streaming solution that is capable of storing data in shards for a certain retention period. Each shard is capable of storing at the rate of 1000 records per second, with a maximum size of each record limited to 1 MB.

Lambda – Listeners that are capable of being triggered when data arrives in Kinesis. Kinesis will buffer the number of records until a batch size is reached or a window expires. For scalability, the number of shards needs to be increased and thereby the number of Lambda processors as well.

DynamoDB – The NoSQL database for receiving the raw data. It is a highly scalable storage option, where data is stored in shards based on the read and write capacity. One consideration here is the maximum limit of read (40000 for certain regions) and write capacity (40000 for certain regions) which, of course, can be increased through request. A max of 40000 indicates 40000 * 4KB * 2 = 320 MB/sec per table. In reality, a read capacity of 4000 is common. A single table on NoSQL would be designed to host multiple entities due to the schema-less nature. The partition key chosen should be in such a way that they are dense so that the data would be uniformly distributed and there won’t be any “Hot” partitions as a result of a sparse key.

Option 2: Open Source Data Pipeline with Kafka

This architecture involves a Kafka-only solution where the Kafka REST proxy is used as the REST end-point to which client deployments would send data. This data would get stored in the Kafka partitions. Kafka-connect could be configured to siphon the data to a distributed file storage like HDFS or an RDBMS like PostgreSQL or a NoSQL database like MongoDB in a batch process. As for stream processing, Kafka consumer programs could process the data performing aggregations and other calculations generating new streams or updating external databases.

Open Source Data Pipeline with Kafka

Open Source Data Pipeline with Kafka 1

A Kafka cluster serving the data pipeline could be easily set up inside a Kubernetes cluster. Kafka consumers would consume the ingested data and produce new topics across multiple partitions. Data from these partitions could be visualized through a web server.

Read More on Cloud Computing

Option 3: Open Source Data Pipeline with Kafka and Spark

Open Source Data Pipeline with Kafka and Spark

In this solution, we use Kafka as the streaming engine that ingests the data from an external source and Spark as the batch processor, storing the data as a distributed dataset in HDFS or NoSQL databases like MongoDB / Redis. The Spark job would start reading data from a particular begin offset in Kafka, up to an end offset remembering both the offsets so as to repeat the process for subsequent iterations.

A separate pipeline could be used for stream processing of the Kafka data to perform aggregations and push the resulting data into an RDBMS or NoSQL database.

Option 4: Serverless Architecture with Kinesis and AMR

Serverless Architecture with Kinesis and AMR

In this architecture, we pump data into Kinesis and have EMR Spark-based consumer programs using Kinesis Client Library to pick up data from Kinesis. The KCL uses a unique Amazon DynamoDB table to keep track of the application’s state called “checkpointing” so that it can recover in case if a processor crash happens. A newly started processor would look for the last read UUID and start off reading messages so that no message is lost. Of course, we need to set the read and write capacity of the DynamoDB such that checkpointing does not suffer. Overall, we should do necessary adjustments so that there are no bottlenecks either at the sender, the Kinesis, the processor, or the storage end.


Irrespective of the data pipeline we choose, it’s necessary that the design should be capable of handling bottlenecks at from all fronts of the architecture.

At the sender side, the Kinesis Producer Library (KPL) could be used to efficiently put records into Kinesis. A Kinesis agent that wraps the KPL is also available that supports retries, aggregations, failover, and such.

At the streamer side, Kinesis Data Stream provides real-time streaming capability. Of course, the number of shards is a key configuration attribute which would depend on the throughput. We can use Cloud watch metrics to monitor throughput exceptions and then decide if we need to increase the number of shards. We can also program the system in such a way to increase and decrease the number of shards based on CloudWatch metrics.

At the processing side, we could use the KCL with checkpointing capabilities so that in case of a processor crash, it automatically spawns another process that would start off from where the previous processor discontinued.

Read More: Exposure in Image and Video Processing

Grow your business & increase sales_Grab your free consultation Now!

Author: InApp
We are a custom software development company offering Testing Services, Application Development, Mobility Solutions & more. Customers: Startups - Fortune 500

1 Comment

  • Anonymous

    Such a very useful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article

Leave a Reply

1 × five =