Big Data Crash Course: Structure, Semi-Structured and Unstructured Data

Structured data has the same attributes (fields) on each line. Much like an Excel document. It has a repeatable and predictable pattern of data throughout the file.

Let’s think about structured data in relation to our lives at work. Do you have a database storing all your customers details and each of the purchases that they make? This is structured data. Every customer record will have the same attributes (first name, surname, address, etc….).

Within any structured data model, we can enforce constraints to ensure that the data conforms to the structure we require.

One such constraint is a value constraint. A good example of this is an age field – we may set a minimum age, preventing younger individuals from registering and we may also set an upper age to remove the risk of clearly anomalous data. When we have both a lower and upper limit, we refer to this as a cardinality constraint.

We can also implement uniqueness constraints – that is a primary key. No two records can have the same customer ID number for example.

Next, we have what we call domain constraints. This is where we set allowable values which may be entered into a field. For example, for month, we would add the domain constraint of Jan, Feb, March… and so on.

The final type of constraint that we will discuss here is a type constraint. That is where we force the data entered to be of a specific data type, for example in a date format.

The schema of structured data will include a number of things:

  • The table name
  • Attribute (column) names
  • Allowed data type for each attribute
  • Constraints
  • Primary keys

Semi-structured data is more flexible than structured. Examples are HTML, JSON and XML. In XML, for example, we have tags, e.g. <id>. These tags can be different in every XML document and even, we can see differing tags within the tree structure of the same XML document.

Take a look at the below. Under the parent <user> tag, we have different fields in each of the two examples. This makes the data somewhat messy – we require a solution that facilitates the use of a ‘flexible schema’ to accommodate semi-structured data.

Unstructured data does not have a pre-defined model. This kind of information can be text heavy (such as the body of a novel), JPG, PNG, MP3, satellite images, web pages and video data.

We can say that user generated data is usually the most unstructured. A great example of this is data generated through social media.