Our world continues to get smarter and more connected as time goes on and as that happens, we continue to see exponential growth in the amount of data we are generating.
As an example, let’s say that you visited Facebook, while you were there you posted something about your day, which ten friends liked and a few commented on. All this data is stored somewhere by Facebook. Now, you may think this is small data, which it is, but imagine multiplying that by the millions of active members that Facebook has – that’s where we start to see a uncontrollable data growth. In fact, in 2012 Facebook were generating over 500 million terabytes of data per day (source), which will now be far larger.
What does Facebook do with this data? They analyse it to enhance the user experience and to cement their place as the top social networking site. As an example, they use all the data they hold to connect the dots between you and ‘people you might know’, which they do with an incredible degree of accuracy. Alright, there are probably a few people that appear in that list that are friends of friends that you’ve never met, but by and large, they’ve done a great job.
This data can also be used for competitive advantage. Not only do Facebook have a great deal of detail about you, including your name, location and date of birth, they also know the sort of thing that you’re interested in, not only based on the groups and pages that you like, but they can also analyse the content of all your posts and comments to ascertain what your opinions are and what you’re interested in at this particular moment in time. All of this information enables Facebook to target adverts very accurately. After all, if they know that I’m a 26 year old Male, living near London that happens to be very interested in Formula One, they can target me with offers for the British Formula One race. As it’s so targeted, I am far more likely to click through, improving Facebook’s PPC revenue.
Wow, they have a lot of data, I won’t have that much!
While I agree that it’s more than likely the case that you won’t have 500TB of data to contend with in your business, you may already pulling data from multiple systems (Google Analytics, Excel, SQL) and you may be carrying out a lot of analysis.
There is no real definition for ‘big’ data. Your data doesn’t have to be big compared to the amount collected by Facebook, it just has to be big enough to make timely reporting of business insights a very real challenge.
So the question is, is your data so big that it’s getting difficult to process it all in time to answer those burning strategic questions. If so, read on for our views on strategy to deploy big data environments.
Remember, big data environments aren’t limited to just Hadoop, Terradata and other high profile solutions. There are other solutions, less synonymous with big data but still very capable, such as MongoDB. It may be the case that you don’t even need a big data environment, you might just need a solid database environment and a visualization tool such as Qlikview, to speed up the analysis of the data you have.
Unless your data is multiple terrabytes in size, pulling from disparate data sources, Hadoop is completely overkill and will just serve as a bottomless pit of money to suck your organization dry. MySQL is a free solution and can handle several terabytes of data, depending on your hardware configuration and a NoSQL alternative (MongoDB) can do the same.
The key here is to understand what you really want. MongoDB is great for real-time, operational applications, serving business processes and end users. Hadoop on the other hand, is great at blending data from multiple sources to deliver sophisticated analytics of your data and machine learning. Generally Hadoop is not designed to replace your traditional MySQL backend databases for your applications.
Many organizations use a blend of the two. Hadoop crunches the data and then loads the results into a MongoDB instance for fast consumption. You’ll need to nail down your requirements before choosing a big data solution.
Okay, I want to run a big data initiative and I want it now…
Alright fair enough, who am I to argue with you? But can you tell me exactly what you want to do with the data?
You’d be amased how many people can’t answer that simple question. They know they’re generating lots of data and they want to analyse it but they’re not sure why and they’re certainly not sure what benefit they’re going to get from that analysis. Before you embark on a big data initiative, it’s important to outline the business benefit of investing time and money into the big data project. This particular phase should also include the process of taking a small sample of data & proving the value that can be derived from your data, before rolling it out on a full scale dataset.
My advice would be to carry out the below:
- Identify the strategic questions that individuals within the business want answered.
- Using a subset of your data, test and document the algorithms which will be used to answer those questions, reverting back to the requestor to ensure that the output is meaningful to them.
- Once you have identified the data that’s going to be analysed, you’ll need to consider the security around that data. How sensitive is the data? Who should have access? Where is the data stored? What controls do you have in place?
- Once you have documented your full suite of data manipulations and required security provisions, you’ll need to understand the load that they will place on your systems. By doing this, you should be able to identify the hardware requirements for your big data environment.
- Next, you’ll need to think about how you want your users to access and query the data. There are plenty of off-the-shelf visualization tools (such as Qlikview, Tableau, etc…), which are great if you want to give business users access to build their own reports and to analyse the data themselves. If you don’t have that requirement, a static, bespoke development may be more cost effective for you.
- Once you have nailed down your hardware and data visualization software requirements, it’s time to start thinking about the full cost-benefit analysis. You should have already identified the commercial impact of your big data project during point one (above), but now you know the hardware specification and the chosen visualization option, you can go back and sanity check that that your original project rationale still makes sense, now that you have a full picture of the associated costs.
Once you’ve done all the above, you can start your roll-out and data migration. This is by no means the end of your big data project. You’ll need to continue to ensure that the system you’ve chosen reports in an effective manner and that it generates answers to those strategic questions that need answering – perhaps even in real-time.
It’s important to continue to monitor your big data environment as time goes on to ensure that it’s delivering the value that the business needs – it’s very easy to continue to throw money at a big data platform that is not delivering the value you need.