Consuming data from AWS Kinesis
July 19, 2020
July 19, 2020
There are several ways to consume data from Kinesis Data Streams.
- SDK (or cli)
- Kinesis Consumer Library (KCL)
- Kinesis Collector Library
- AWS Lambda
You can also use some non-AWS tools like Apache Spark.
When using the SDK, you would pull records from a shard using the GetRecords method with a max of 2MB total of aggregate throughput.
GetRecords returns up to 10MB of data or up to 100 records within a 5-second interval. You will also need to throttle your consumer so that it does not exceed 5 GetRecords API calls per shard per second (200ms latency). This means that as you scale up consumers running GetRecords per individual shard, it can impact your throughput.
The KCL is available in multiple languages (GoLang, Python, Ruby, Node, .NET). The KCL has a checkpointing feature so that it can pause and resume pulling data from Kinesis. It can do this by leveraging DynamoDB. This means that if you don’t provision your WCU/RCU for DynamoDB well enough, it could turn into a bottleneck even if you have enough throughput in your stream.
The Kinesis Connector Library is older. It runs on an EC2 instance and can get data from streams and write it to other AWS services. It isn’t used too much given the other options that are there.
Lambda is a good way to read from a stream if you want to use a serverless option.
Advanced Data Engineering Platform for Cleansing, Preprocessing and Analytics