While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that they’ve got a firm grasp on the concept of the technology and its benefits. As such, the ideal starting point of this post is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.
To quote SAS (source), “Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses.”
Unfortunately, there are no set parameters which will enable you to calculate if your data set should be considered to be ‘big data’, however, there are some factors we can apply to conclude whether you do indeed have a big data project on your hands.
The first thing to note is that the definition of big data will be influenced by the capabilities of the organization that owns the dataset. For example, for a small company, with only Microsoft Excel at it’s disposal, 10GB of data could be an overwhelming amount to analyse and could take a long time to process. Conversely, global organizations, such as Amazon, will process thousands of gigabytes of data each hour – so to them, 10GB would not be considered a big data project at all.
Along the same lines, drawing information from a local database and visualizing that data to gain insight may be considered as a big data project by a small organization. However, in larger companies, data will be drawn from multiple sources, which may include: local databases; Microsoft Excel files; web services (such as Google Analytics) and plenty more. Once again, this comes down to your company’s capacity to implement big data solutions that draw upon multiple data sources.
Personally, I consider that if your data grows by 50% year on year and has a variety of data sources, then you are well within your rights to consider your project to be a big data project.
Let’s get into the detail
Big data can be defined through the use of the 6 V’s. They are Volume, Variety, Velocity, Veracity, Valence and Value.
- Volume is about the size of the data, the greater the volume, the greater our data accessibility, storage and query challenges are.
- Variety: the different types of data we can store. For example, images, audio, text, video. We can ask ‘how heterogeneous is our data’.
- Velocity: the speed at which data is generated.
- Veracity: the uncertainty of data (accuracy / reliability). The more unstructured data we bring into our environment, the lower the quality is likely to be.
- Valence: the connectedness of data. That is, the fraction of data that is actually connected to other items vs the total possible connections.
- Value: the business benefit derived from the data.
Data semantics can be another challenge. In the same dataset, we may have weight measured in lbs and kilograms. We may have age as a number and also as a grouping (e.g. child, adolescent, adult).
So, if we were to assign one word to each of the V’s, we would assign:
- Variety: Complexity
- Volume: Size
- Velocity: Speed
- Valence: Connectedness
- Veracity: Quality
- Value: Benefit
So, what about data science?
Data science is the process of extracting insight from your data through statistical methods. To implement a data science strategy we need to follow the below steps:
- Define your short and long term business objectives. That is, what are you trying to derive from the data?
- Attain organizational buy-in & build some excitement around your proposed big data benefits.
- Build a team of data scientists with diverse expertise.
- Implement a continuous training and development strategy for your team.
- Open a mini-lab whereby a portion of your data science team work on R&D projects which could benefit the wider business once proven.
- Share data widely: remove barriers to data & remove silos from within the business (while retaining data access restrictions required by law / regulation).
- Define policies (privacy, data lifetime, regulation, quality).
To support the generation of our data science strategy, we can use the 5 P’s:
- People: data science team and business stakeholders.
- Purpose: the challenges that your organization seeks to solve with big data.
- Process: how the data science team will work around a problem.
- Platforms: the scalable platforms your company will adopt (e.g. Hadoop).
- Programmability: the ability for the output to be integrated with visualization tools and other systems and to be queried utilizing programming languages (e.g. Python and R).
Now, let’s say that we have our data science strategy & business architecture in place. It’s now a case of getting into the data and realizing value. For that, we follow the process: Acquire > Prepare > Analyse > Report > Act. Each of these is discussed in detail below.
- The Acquire stage of the data science process is all about identifying the datasets that you need, retrieving those datasets, querying the data source and moving the data (e.g. moving data from a rest API into your local environment). We can find data in many locations:
- Web services (REST, SOAP, Web Socket), including JSON, XML, HTML5 and RSS.
- Text files, including CSV, Excel, etc…
- Structured data, including all your standard relational databases.
We can then load that data into suitable big data solutions. Unstructured data is suited to: HDFS, Cassandra, MongoDB, HBase and structured data is suited to a traditional RDBMS: MySQL, SQL, Oracle.
- Prepare is all about, well, preparing the data. It’s where we explore the data further to see exactly what we have & identify any obvious correlations, general trends or outliers in the data. We may choose to summarize the data with the mean, median, mode, range and standard deviation of the dataset. This stage also includes ‘pre-processing’ (also known as munging or wrangling), which includes the filtering and clean-up of data (removal of duplicates, inconsistencies, fixing missing data issues and identifying invalid data), integrating multiple data sources, removing or combining multiple fields and aggregating data.To reduce the scale of the chart, we may choose to transform all values so that they sit between 0 and 1. For example, someone’s weight could be 40KG, while another could be 200KG. We may choose to represent these as 0.2 and 1.
- Next, we head into the Analyse stage where we’re going to select our analytical techniques and start to build models. There are a few high level types of analysis we can use:
- Classification: predicts a particular category. For example, using data we may be able to predict that a tumour will be benign.
- Regression: predicts a numeric value, for example, the price of a stock or predicted first-week sales of a new product.
- Clustering: organizes similar items into groups. For example, we may group customers as ‘children’, ‘teens’ and ‘adults’.
- Association: helps us to understand unlikely associations between items. For example, a supermarket in America determined that late night shopping trips on a Sunday for Diapers would also be a peak time for beer. The two products were frequently bought by dads & hence the store was able to put them next to one another – seeing excellent sales.
- Graph analysis: produces a chart with nodes & lines between them. It can help us to understand the spread of disease, the severity of threats and other insight.
- We then report (communicate) our findings to the business. This can include visualizations, reports and data summaries, focusing first and foremost on the business value that has been delivered.
- We then must act on our results to achieve business value.