There are a number of different parts to the design of a big data cluster. We can get to the bottom of many data management issues by asking a few questions about: integration, storage, quality control, operations and scalability / security. We will discuss each section in detail below:
When we’re looking at data ingestion into the platform, we need to ask a few questions:
- How many data sources do we currently have and do we anticipate a growth in the number of sources?
- How large is the data? How large is each record and how many records do we have?
- How quickly is our data growing?
- What are our policies / rules around ‘bad’ data?
- What are our error handling policies (e.g. retry once, then discard)?
- What are our policies / rules around changes in data rates (significant increase or decrease)
All of the above will provide us with a strategy around our data ingestion. We then need to think about how we’re going to store the data we ingest – this is a more complex problem than it may sound.
As we can see from the above, we have a great deal of storage options. Our L1 cache provides us with the most preformant solution but scaling that to handle big data is unrealistic & extremely costly. On the other end of the scale, tapes are very cost-effective, but lack performance & may cause bottlenecks in our big data cluster.
Part of your role when designing your big data cluster is to balance cost and performance. We can also implement archiving rules whereby data over a certain age moves to a slower type of storage.
We also have NVME, which is defined here as being “a communications interface/protocol developed specially for SSDs by a consortium of vendors including Intel, Samsung, Sandisk, Dell, and Seagate”. To put it simply, it enables fast transfer between memory & SSD’s. This should also feature as a consideration in your cluster strategy.
Operations carried out on your big data cluster must be as efficient as possible. Efficiency is measured by the time, space and resources that an operation consumes. We can improve efficiency by utilizing parallelism. Let’s look at an example.
Here, we have 6 data points (1, 2, 3, 4, 5, 6). If I were to want to run a script to pull out only the even numbers, I could run it on the entire dataset in one go. Or, I could split it into three sub-sets of data (1,2), (3,4) and (5,6) and could run three operations in parallel, each to extract the even numbers from the subset of data that they’ve been given.
Finally, we need to think about scalability & security. When we think about scaling, we need to consider whether we want to scale vertically or horizontally. Vertical scaling is the process of adding more compute / memory to a machine. Horizontal scaling is the process of adding more machines to the cluster to share workload. Horizontal scaling is much more popular and usually more cost effective.
However, more machines does lead to more security risks, making security even more imperative. We need to secure data at rest, where possible and when transferring data across the open web, we certainly need to consider whether we need to encrypt our traffic in transit. Remember though, encryption / decryption will add significant compute overhead to your cluster & therefore may build delay into your data analysis and value extraction.