Hadoop has become synonymous with the term big data but that doesn’t mean that it meets the needs of all big data problems. The platform is very good at processing large quantities of unstructured and semi structured data, where time is not a constraint.
Where Hadoop is not an ideal solution is when it comes to running real-time analytics. This is because Hadoop is designed to run batch jobs that are not time critical. As such, Hadoop does not have a place in many web-based environments where performance is crucial.
What is MongoDB?
MongoDB is a NoSQL document store, generally used as an operational real-time data store. It offers very good performance for situations containing high read/write loads.
Historically, a big painpoint of database management was adding a column. In a traditional RDBMS such as MySQL this activity can lock the entire database and cause performance issues, this is not the case with MongoDB as it is schema-less, meaning that adding a new field does not affect the old rows in the database and the addition of a new field will be instant.
Additionally, partitioning and sharding is easier in MongoDB. Sharding is the process of storing data across multiple machines.
What is Hadoop?
Hadoop is a technology developed by the Apache Foundation that is designed to store and process large volumes of distributed data across multiple commodity servers.
Primarily, it is used for offline batch data processing and analysis, where time is not a constraint. Many queries can take minutes / hours to process.
When are MongoDB used in conjunction with one another?
In the event that complex data aggregation is necessary, Hadoop can be used in conjunction with MongoDB.
For example, an investment bank would pull time critical data (such as ticker data, reference data and historical performance data) from a MongoDB database. For much heavier workloads that are not time critical (i.e. not required in real time), the company would use a Hadoop environment. These tasks would include risk modelling and fraud detection.
Another example would be Facebook. They would store all data that is required in real time, such as chat messages, likes and user data in a MongoDB structure. They would then carry out detailed user analysis, segmentation and personalization as a batch job on a Hadoop cluster as this particular task is not time sensitive.
The final example would be a mobile operator. When a user logs into their online account, the website will show them exactly what they’ve been spending their quota allowance on. For example, they may have used 1GB of data while browsing Facebook. This data will have been run as a batch job through their Hadoop cluster, using web log data to ascertain the total data usage for each request – the aggregated result of which could be loaded into a MongoDB database for fast access when the user logs into their account.
The way that the two are often integrated for optimal performance is:
- Data is pulled from MongoDB into the Hadoop cluster
- The data is processed by Hadoop and potentially fused with other data sources
- The output of this processing can be written back to MongoDB for fast querying on the front end and adhoc analysis
In summary, it is clear that MongoDB is complementary rather than an alternative to Hadoop. Specific business requirements will determine whether the best option is MongoDB, Hadoop or a combination of the two.