Kinesis Firehose: An Overview

Kinesis Firehose is specifically designed to load streaming data into S3, Redshift or Elasticache and no other destinations. Some key notes on Firehose are:

  • You can change the destination (between S3, Redshift or Elasticsearch) without downtime
  • Firehose is fully managed, automatically auto-scaled and requires no programming to configure
  • Firehose manages all sharding and management of your stream
  • You can transform data using lambda (e.g. encryption or changing formatting)
  • There is a one minute data latency delay. This means, it takes one minute for the data to show in S3 or lambda once it’s hit the stream, this is not as fast as Kinesis streams
  • Guarantees at least once delivery
  • Does not allow for parallel consumption (cannot dump data to S3 + Elasticsearch, only one per Firehose stream)
  • Integrates with Cloudwatch (records in/out, size, count, latency) and sends metrics every minute for near real-time monitoring
  • Transformation KPIs are also included in Cloudwatch, which includes the failures and successes of data transformations

Get data into firehose using the Kinesis agent, Kinesis producer library (KPL) or the AWS SDK

Load data from Firehose into S3

  • When loading data into S3, we are able to set a buffer size (the amount of data that Firehose will accumulate before dumping into an S3 bucket. This can be anywhere between 1MB and 128MB)
  • The buffer size is applied before compressing data. E.g. if the buffer = 128MB, then the data actually loaded into S3 may be smaller as the 128MB is compressed
  • You can also set buffer intervals – 60 seconds to 90 seconds.

Load data from Firehose into Redshift

  • Firehose will automate the process of saving transformed data into S3 & then copying that data into Redshift (this is constrained by the rate at which Redshift is able to load your data)
  • The compression used must be Gzip
  • You can have an S3 bucket in region1 and copy to Redshift, even if the Redshift instance is in another region
  • You can keep a copy of original and transformed data in S3

Load data from Firehose to Elasticsearch

  • You can utilise a lambda function to dump data directly into Elastic Search from the stream
  • You can keep a copy of the original and transformed data in S3

Lambda and Firehose:

We can use Lambda to validate data, translate fields  or change data format. There are lambda blueprints for transforming data (e.g. changing data format).

Common Firehose issues:

  • Duplicate records
  • Delivery failures (a filed folder is created in S3 for error logs)

Firehose Limits:

  • 50 streams per region (soft limit)
  • 24 hour data retention only
  • Max payload of 1MB
  • Buffer size for S3 is between 1MB and 128MB
  • Buffer size for Elasticsearch  is between 1MB and 100MB
  • One stream to one S3 bucket
  • 5MB per second (soft limit)
  • 5000 records/second (soft limit)
  • 2000 transactions / second (soft limit)
  • 4MB or 500 records limit on PutRecordBatch API (batching records together to reduce load).

Firehose Pricing:

  • Price for the cumulative amount of data ingested (per GB)
  • Each transaction payload is rounded up to 5KB – so 7KB would be rounded to 10KB, so best practice is to batch records to reduce cost.