Big Data Series: What’s Hadoop good for & does Hadoop in the cloud make sense?

What is Hadoop good & not so good for?

Hadoop is good for a number of things:

  • For projects from which we see future anticipated data growth
  • For projects that require long-term availability of data with relative ease of accessibility
  • For high volume and variety of data

So, where is Hadoop not so strong / where should we consider different solutions?

  • When we have a small data set
  • When we see task parallelization *
  • Advanced algorithms may not be reducible to a processing model supported by YARN
  • When we’re looking at replacements for our current data infrastructure – can we justify Hadoop? Will it cover all our requirements?

*As we mentioned before, data parallelization is where we run the same function across multiple nodes. Task parallelization is where we run many functions across many nodes at the same time.

Just because we have a ‘big data’ project, it doesn’t mean Hadoop is always the solution. Whatever we choose is going to require a large level of investment & hence, we should analyse carefully & proceed with caution.

Hadoop & the Cloud

Following on from the above discussion around the need for detailed analysis to prevent costs that were not 100% required, we need to now look at how we’re going to deploy our big data solution. We have two options: the cloud and on premise.

If you’re going to build yourself, you’re going to need to assemble a team that can:

  • Rack & stack storage & compute power
  • Configure a low-latency network
  • Manage upgrades to the cluster & expansion
  • Estimate hardware requirements (notoriously hard)
  • Ensure software and patches are up to date
  • Invest significant CAPEX to make all this happen

If you’re putting your cluster in the cloud (e.g. AWS), most of the problems go away. There is minimal up-front investment, quick deployment, automated scaling and capacity management and some cloud providers will also provide automated software updates.

This article may seem loaded towards the cloud solution & it’s true that it probably is. In most cases, the cloud does represent the best value for money, flexibility and scalability. There will however be situations in which the cloud is not the best example. For example, if your infrastructure is on premise & generates a huge amount of data, you may find data transfer latency & costs are too high to consider the cloud for your project.

Before we sign-off from this ‘big data & the cloud’ section, let’s talk about the different types of cloud services:

  • IAAS (Infrastructure As A Service): is as it sounds. You are responsible for everything, except the bare metal machine.
  • PAAS (Platform As A Service): is where the cloud provider will provide you with a box including an OS, database server & other options to enable you to develop your application immediately.
  • SAAS (Software As A Service): is a fully managed software solution. All you need to do is utilize the software – all hardware, operating system and updates are handled by the provider.

We are now moving into a world where we have XAAS, which denotes ‘Anything As A Service’.