Big Data Series: Data models & data types

Before we look at ways of modelling data, we first need an appreciation of the types of data we have, these are known as structured, semi-structured and unstructured data.

Structured data: has the same attributes (fields) on each line. Much like an Excel document. It has a repeatable and predictable pattern of data throughout the file.

Within the structured data model, we can enforce constraints. These can include:

    • Value constraints
    • Uniqueness constraints – Primary Keys
    • Cardinality constraints – minimum and maximum values within which the number must fall
    • Type constraints – data types (e.g. text or date)
    • Domain constraints – a list of acceptable values (e.g. months of the year)

The schema of structured data will include a number of things:

    • The table name
    • Attribute (column) names
    • Allowed data type for each attribute
    • Constraints
    • Primary keys

Semi-structured data is more flexible than structured. Examples are XML, HTML and JSON. In XML, for example, we have tags, e.g. <id>. These tags can be different in every XML document and even, in the tree structure, the sub-tags can differ. Take a look at the below. Under the parent <user> tag, we have different fields in each of the two examples.

Unstructured data: does not have a pre-defined model. This kind of information can be text heavy (such as the body of a novel), JPG, PNG, MP3 and video data. We discuss text analysis in the next article. More examples of unstructured data are:

  • Image / photos
  • Videos (WAV, MP4)
  • Social media data
  • Text-heavy data
  • Ebooks
  • Web pages