Big data series: Data streams and data lakes

A data stream is a continuous flow of data which is potentially unbounded in size. Let’s use an example – imagine you’re a data scientist in Twitter & you have 6,000 tweets per second flowing into your big data cluster. You need to run some initial aggregations and calculations on the data to derive some insights before handing it over to the business for further (non-real time) analysis.

One of the big challenges with streaming data is that it’s unpredictable. Twitter average 6,000 tweets per second. However, in 2013, they say 143,000 tweets in a single second (source). So their data scientists needed to be able to handle the massive influx.

The way we can do this is:

  • Keep streaming data calculations / operations as simple as possible
  • Carry out operations on a single unit of data at a time
  • Utilize a scalable environment*
  • Ensure the streaming service is a ‘subscriber’ to the data source. It does not need to feedback or interact with the data source

Additional complexity is built into the streaming data equation when we start thinking about dynamic steering. That is, we determine the next steps of the application based on real-time data analysis. Good examples are a self-driving car or online gaming.

Many organizations implement what we refer to as Lambda which is the concept of utilizing both streaming (data-in-motion) and batch (data-at-rest) jobs to extract insight from data. The idea is, we carry out quick real-time analysis before batching the data up & gathering further insights from more-detailed, heavier data analysis.

In Lambda, we consider that we use a Stream storage layer to quickly read streams, ensure data order and consistency. We also use a processing layer, which retrieves data from storage & loads into a batch job.

Often, an organization will load all streams into a data lake for analysis. A data lake stores data in its raw form (it’s stored in it’s native format, before ETL).  From the data lake, we can carry out our large batch jobs. When the application reads the data, it applies a structure to that data – we call this ‘schema on read’. Each object in a data lake is stored as a binary large object (BLOB).

The challenge with streaming data is that we have a stream of data that can rapidly & unpredictably change in size but our time and storage to deal with the data are finite.

*Scalable environments include AWS (utilizing Kinesis, Storm, Flink, Spark, Kafka and Samza).