Big Data Crash Course: What is big data?

While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that we’re talking the same language and that we share the same understanding of the technology and its benefits. As such, the ideal starting point of this course is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.

To quote SAS, “Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses.”

Unfortunately, there are no set parameters which will enable you to calculate if your dataset should be considered to be ‘big data’, however, there are some factors we can apply to conclude whether you do indeed have a big data project on your hands.

The first thing to note is that the definition of big data will be influenced by the capabilities of the organization that owns the dataset. For example, for a small company, with only Microsoft Excel at its disposal, 10GB of data could be an overwhelming amount to analyse and could take a long time to process. Conversely, global organizations, such as Amazon, will process thousands of gigabytes of data each hour – so to them, 10GB would not be considered a big data project at all.

Along the same lines, drawing information from a local database and visualizing that data to gain insight may be considered as a big data project by a small organization. However, in larger companies, data will be drawn from multiple sources, which may include: local databases; Microsoft Excel files; web services (such as Google Analytics) and plenty more. Once again, this comes down to your company’s capacity to implement big data solutions that draw upon multiple data sources.

Personally, I consider that if your data grows by 50% year on year and has a variety of data sources, then you are well within your rights to consider your project to be a big data project.