Hadoop is an open source frame work from Apache Software Foundation for storing & processing large-scale data usually called Big Data which grows exponentially day by day and can’t be processed easily by legacy databases. Hadoop can easily processes large-scale data on clusters of commodity hardware. Hadoop was created by Doug Cutting, creator of Apache Lucene, text search library. Apache is written in Java.

Evolution of Hadoop

Now a day the electronic data is getting increased rapidly day by day in terms of tera bytes (1000 GB = 1 TB) or peta bytes (1000 TB = 1 PB) all over world. This data is majorly stored on databases, distributed across the globe. The rate of data increase is getting accelerated. Some of the data might be structural and some might be unstructured data like flat data sets. Some of the examples of huge data generations sources are like social networking sites, blogs, databases and many other kinds of web sites. This data is being used by various organizations/industries for analyzing and foreseeing trends of business in near future based on the analysis of current data statistics.

But extraction & analysis of vast amount of structured data or unstructured data requires lot of computational power which is beyond the scope of legacy databases or processing techniques. This massive explosion of data over the years leads many organizations to replace the data servers with high processing servers which couldn’t solve the problem beyond a certain point of growth in data.

That’s where the Hadoop evolution started based on scale-out approach for storing big data on large clusters of commodity hardware. Since Hadoop is designed to to use commodity hardware through on scale-out approach instead of using the larger servers in scale-up approach, data storage and maintenance became very cheap and cost effective when compared to other storage mechanisms.

For processing this Big data distributed across various clusters of commodity hardware, Map Reduce technique is introduced to parallelize the process of data extractions & processing of structured/unstructured data from many nodes/hard drives in the clusters.

Hadoop was created by Doug Cutting who is the creator Apache Lucene, a text search library. Hadoop was written in Java and has its origins from Apache Nutch, an open source web search engine. As Apache Software Foundation developed Hadoop, it is often called as Apache Hadoop and it is an Open Source frame work and available for free downloads from Apache Hadoop Distributions.

Most Popular Hadoop Distributions

Currently there are lot of Hadoop distributions available in the big data market, but the major free open source distribution is from Apache Software Foundation. And even remaining hadoop distribution companies provide free versions of Hadoop, and also provide customized hadoop distributions suitable for client organization needs. By using Apache Hadoop as the core framework, these companies build their own customized hadoop cluster setup and services and provide commercial support for big data organizations. These are known as commercial hadoop distributions. These hadoop vendors provide services like managing updates, providing support, training, and consulting, and even adding some innovations of their own that make Hadoop reasonable for an enterprise to handle.

In Free Open Source market, Redhat is making money by taking Unix/Linux Core Kernel (an open source operating system) bundle all its required components, building a simple installer, and providing paid support to any customers.

In the same way, there are many companies which are providing enterprise editions and paid support on top of apache Hadoop distribution.

Free Open Source Hadoop Distribution

  • Apache Hadoop images
    • Core Hadoop Distribution Used by all other distributions
    • Complex Cluster Setup but No Commercial Support
    • Manual Installation and Integration of Hadoop Eco System Components like Hive, HBase, Pig, etc.
    • Right choice for free trial / test demo purpose.

Other Popular Hadoop Distributions

  • Cloudera Hadoop                                  cloudera hadoop
    • Hadoop’s co-founder, Doug Cutting, is its chief architect
    • Cloudera is the Market leader in the Hadoop space because it released the first commercial Hadoop distribution
    • Highly active contributor of code to the Hadoop ecosystem
    • Provides Cloudera Distribution for Hadoop (CDH) Parcels as well as powerful management and monitoring tool, Cloudera Manager for Hadoop administration.
    • Its approach is to take components it deems to be mature and retrofit them into the existing production-ready open source libraries that are included in its distribution.
    • Formed in 2008 with its core distribution based on 100% open source Apache Hadoop.
    • CDH may be downloaded from Cloudera’s website at no charge upto 50 data nodes large cluster, but with no technical support nor Cloudera Manager.
  • Hortonworkshortonworks_logo
    • Fast growing company and Started in 2011.
    • Another Major Player in Hadoop market.
    • Initially originated from Yahoo and has the largest number of committers and code contributors for the Hadoop ecosystem components.
    • Releases Hortonworks Data Platform (HDP), which includes Hadoop as well as related tooling and projects
    • Hortonworks has collaboration with major data management companies like Teradata, Microsoft, Informatica, and SAS to provide integrated Hadoop solutions with their own product sets.
    • Uses Apache Ambari for management, Stinger for queries, and Solr for searches.
  • Amazon Web Services Elastic MapReduce (AWS EMR) Hadoopamazon_web_services_logo_aws
    • Hosted Hadoop framework running on the web-scale
      infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple
      Storage Service (Amazon S3).
    • Provides Management Software and GUI Support
    • Provides enhanced Data protection
  • MapR HadoopMapR_logo
    • Provides complete distribution of Apache Hadoop and related projects that’s independent of the Apache Software Foundation.
    • MapR is being promoted as the only Hadoop distribution that provides full data protection, no single points of failure, and significant ease-of-use advantages.
    • It has customized underlying HDFS into its own proprietary version MapRFS that is intended to improve efficient management of data, reliability, and ease of use.
    • Three MapR editions are available: M3, M5, and M7.
    • The M3 Edition is free and available for unlimited production use;
    • MapR M5 is an intermediate-level subscription software offering;
    • MapR M7 is a complete distribution for Apache Hadoop and HBase that includes Pig, Hive, Sqoop, and much more.
  • Pivotal Greenplum Hadooppivotalhd_logo
    • Integrates EMC’s massively parallel processing (MPP) database technology (formerly known as Greenplum, and now known as HAWQ) with Apache Hadoop
    • High-performance Hadoop distribution with true SQL processing for Hadoop.
    • SQL-based queries and other business intelligence tools can be used to analyze data that is stored in HDFS
  • Intel Hadoopintel-xeon
    • Provides excellent performance with optimizations for Intel Xeon processors, Intel SSD storage, and Intel 10GbE networking.
    • Provides data security via encryption and decryption in HDFS
    • Supports role-based access control with cell-level granularity in HBase.
    • Improved Hive query performance.
    • Support for statistical analysis with open source statistical package R, and analytical graphics through Intel Graph Builder.
  • IBM InfoSphere Big Insights 
    • Focus around value add on top of the open source Hadoop stack
    • BigInsights comes with a built in browser-based spreadsheet tool called BigSheets
    • Great support for Adaptive Real-time Analytics and good text analytic capabilities by using the AQL and JAQL.
  • Microsoft Hadoop on Windows Azurehadoop_azure
    • Microsoft HDInsight is integration of Apache Hadoop version and Hortonworks Data Platform on Windows Cloud Platform Azure
    • Currently supports Pig, Hive, and Sqoop
  • DataStax Hadoop                   Datastax logo
    • DataStax Enterprise big data platform consists of open source tools Apache Hadoop, Cassandra, Solr, Hive, Pig, Mahout, etc.
    • DSE is designed to manage real-time, enterprise search data in the same database cluster.
    • It also comes with OpsCenter Enterprise, which allows for the management DSE Clusters via a central web interface.

Apart from these, there are many other hadoop distributions but all of these are open sourced under Apache’s GNU Public License.

Comparison Chart

Below is a good comparison chart prepared by Robert Schneider, in the Hadoop Buyer’s Guide in 2014 for Cloudera, Horton Works, MapR, the three leading commercial Hadoop Distributions.

Cloudera-vs-HortonWorks-vs-MapR

By looking at the above comparison chart one may feel that MapR is better among all the three but before concluding that, we need to consider a few characteristics of it.

  • MapR has its own Proprietary File System (MapRFS), It will be painful if an organization has to switch the hadoop vendor from MapR to any other because of its own FS being different from native HDFS in other distributions.
  • Mutable Keys: The MapR file system allows mutable keys while HDFS does not.  The idea of being able to change an established key (mutability) is risky and potentially dangers the ability of all previously built applications to use the data if the keys were inadvertently changed.  MapR argues that there are actually advantages of mutability and that good data management practices eliminate this risk.

So, an organization has to review all the strengths and weaknesses of each vendor before choosing hadoop distribution for their big data business intelligence platform.