Big Data Crash Course: MongoDB

MongoDB is a document oriented database and is one of the leaders of the NoSQL revolution. To summarize what a NoSQL database is, we can say that it’s a structure that enables us to consume data of varying structure (structured, semi-structured and unstructured), of varying schema and data type. They enable us to respond quickly to user demand through scalability and flexibility.

So what about MongoDB? It’s become somewhat of a leader in the NoSQL space and has huge community support. It follows a structure that uses: databases, collections and documents rather than the traditional: database, table, rows (tuples) framework we see in relational database systems.

The below diagram shows this nicely. Where collection would relate to a table and documents would relate to rows. Remember, in these collections, we do not enforce a schema. That means, all the documents held within can have different fields. However, all documents within a collection will be intended for a similar purpose.

Documents are stored within MongoDB in binary JSON format as key-value pairs, which leads to fast data lookup times when looking up on a particular key. We can further improve performance by indexing on multiple attributes.

JSON, nested structures and arrays are supported by MongoDB. The same structure in RDBMS would require multiple tables with foreign key relationships. Without these joins, MongoDB makes querying much faster.

So, why is Mongo leading the charge for NoSQL? First off, it provides very high performance as we are able to load balance across primary and replica nodes. Generally, all write actions will be made against the primary while read activities will hit the replica nodes. This ensures that the server is not congested by an influx of concurrent requests. Should we find that we’re still running into capacity issues, MongoDB is very easy to scale out to counter this.

MongoDB also offers high availability through replication. As mentioned above, we have one write database and several replicas for read requests. In order to setup high availability with MongoDB, we require a minimum of three servers for resilience: primary, secondary and arbiter. The arbiter doesn’t store any data, it makes the decision of which server should be the new primary in the event of a failure.

Like HBase, MongoDB offers automatic partitioning / sharding. Sharding is the process of breaking large data sets into smaller, more manageable chunks (called data shards). This enables us to adopt a parallelized processing model.

Just like HBase, MongoDB can also handle structured, semi-structured and unstructured data.

Unlike HBase which has immediate consistency, MongoDB does not & only offers eventual consistency, which makes it less suited to real-time read and write activities.

In my opinion, MongoDB is well suited to massive data analytics on structured / unstructured data that is not required in real-time. While HBase is suited for large scale analytics on those datasets which must be written to and read from the database in real time.

An experiment carried out by DataStax (datastax.com/nosql-databases/benchmarks-cassandra-vs-mongodb-vs-hbase) showed that HBase carried out between 193% and 219% of the total write operations of MongoDB with the same number of nodes in the cluster and the same dataset – reinforcing the notion that it is well suited to real time operations.

Of course, different use-cases will produce different results, so it is still important to test each solution in your own environment and make an informed decision that works for you.

Find out about querying MongoDB here