Before we look at ways of modelling data, we first need an appreciation of the types of data we have, these are known as structured, semi-structured and unstructured data.
Structured data: has the same attributes (fields) on each line. Much like an Excel document. It has a repeatable and predictable pattern of data throughout the file.
Within the structured data model, we can enforce constraints. These can include:
- Value constraints
- Uniqueness constraints – Primary Keys
- Cardinality constraints – minimum and maximum values within which the number must fall
- Type constraints – data types (e.g. text or date)
- Domain constraints – a list of acceptable values (e.g. months of the year)
The schema of structured data will include a number of things:
- The table name
- Attribute (column) names
- Allowed data type for each attribute
- Primary keys
Semi-structured data is more flexible than structured. Examples are XML, HTML and JSON. In XML, for example, we have tags, e.g. <id>. These tags can be different in every XML document and even, in the tree structure, the sub-tags can differ. Take a look at the below. Under the parent <user> tag, we have different fields in each of the two examples.
Unstructured data: does not have a pre-defined model. This kind of information can be text heavy (such as the body of a novel), JPG, PNG, MP3 and video data. We discuss text analysis in the next article. More examples of unstructured data are:
- Image / photos
- Videos (WAV, MP4)
- Social media data
- Text-heavy data
- Web pages