MapReduce is an algorithm approach to deal with Big Data. It is developed inside of Google and then Google published a paper to share the nuances of that approach with the rest of the industry. Open source movements then kind of hovered around MapReduce and Hadoop is the open source implementation of Google’s MapReduce. We have seen how we can define big data now let’s see some technologies in big data.
What is Map Reduce?
MapReduce is taking a large amount of data, dividing it into several smaller batches of data and then processing each of those batches in parallel. However, with Map Reduce, it’s a two-pass process. The first pass is called the map step and the second pass is called the reduce step. The map step concerns itself with splitting the data up and doing some pre-processing on each of the chunks. The reduce step will then take the output of the map step and aggregate the data. Specifically, the map step will be responsible for outputting data in a key and value format. The reduce step will then expect the data in that format, it will expect that the data is sorted by the key so that all data items related to a given key are contiguous and it will produce output where there is only one piece of data for each key and in that sense, it’s aggregating all the rows of data for a given key into a single row.
What is a Data Scientist?
Data Scientists are who good at statistics, also with subject matter expertise, Hadoop experts, and R programming developers. This all collaborating together we can call as Data Scientists.
Big Data Technologies
What is Hadoop?
- Hadoop is an Apache project that combines a MapReduce engine with a distributed file system called HDFS, the Hadoop Distributed File System.
- This is an open source implementation of Google’s MapReduce and GFS file system.
- Hadoop typically used in combination with complementary tools/technologies and together that make up a Hadoop Stack.
Let’s go through one by one element in Hadoop Technologies.
Hadoop – It provides map reduce and files services to Big Data professionals.
Apache HBase (DataBase) – On top of Hadoop we might use one or another NoSQL database and a special category of NoSQL databases are two most pertinent of which are HBase and Cassandra. Hbase can use HDFS to store its tables. Cassandra can use the HDFS compatible Cassandra file system to store its tables. So as far as Hadoop is concerned, tables in these databases are just HDFS files and because of that, data stored in these databases can be operated on using MapReduce jobs and Hadoop. However, the databases can be used without Hadoop and Hadoop can be used without the databases.
Hive and Pig Latin (Query) – Hive and Pig both provide a kind of query language abstraction over MapReduce and over the Java code required to create MapReduce jobs. With Hive you use a dialogue of SQL called HiveQL and with Pig you’re using more stepwise data transformation language called Pig Latin. In either case, though, the queries that you submit to these particular products eventually get compiled in a way to Java MapReduce code, but either way, users who want to do data analysis don’t have to write the Java themselves.
SQOOP (RDBMS Import/Export) – It is often used in combination with Hadoop is something called Sqoop which is an acronym for Sequel to Hadoop, and Sqoop is really just an import-export framework for moving data between Hadoop and relational databases, typically data warehouse systems. There are Sqoop connectors for most of the major data warehouse platforms at this point.
The mahout (Machine learning, Data Mining) – We can also do Machine Learning, predictive analytics, data mining on top of Hadoop, and we use another Apache project called Mahout to get that work done.
Flume (Log File Integration) – Finally, we have something called Flume that is used really to deal with streaming data and merging that into HDFS. Typically that streaming data might be log files and the word Flume and log obviously have a little bit of a connection there, but Flume can be more generally thought of as a way to process streaming data into HDFS.
What is Hive?
- Hive is the top level Apache project.
- It provides SQL like abstraction over MapReduce.
- It has its own HDFS table file format.
- It can also work over HBase.
- It acts as a bridge to many Business Intelligence (BI) products which expect tabular data.
Other Hadoop Technologies
R Programming Language – R Programming is an open source statistical programming language.
Lucene – Lucene is the open source technology for building a full-text search indexing platform. Lucene and Hadoop both originated by Doug Cutting. It is used in Mahout. Lucene is hosted as a service on Azure.
Massively Parallel Processing
In Massively Parallel Processing some things are similar to MapReduce and some things are not. Let’s look at both of them.
Similar to MapReduce
- It splits up every work into subqueries.
- It distributes over nodes into the cluster.
Dissimilar to MapReduce
- It uses SQL typically not imperative code.
- No, Reduce Step
- Returns a single result set.
- Packaged as a physical appliance.
- Doesn’t use direct attached storage.
It is often combined with column store technology. It stores data column-wise not row-wise. High compression in memory.
- NoSQL databases excel at storing unstructured and semi-structured data.
- It sacrifices database consistency for availability and ease of partitioning/clustering.
- It is extremely popular and well-hyped.
- Four types: key-value stores, document stores, graph databases and wide column/column family stores.
- NoSQL follows Cap theorem.
- Cap Theorem says database should have three qualities. Consistency, Partition Tolerance and Availability.
- Schema-free storage and availability.
- Relational and consistent.
- Acknowledges the value and prevalence of SQL skill sets.
- Relational and SQL-driven, ACID-compliant.
- Employ scale-out architectures.
- Partition tolerance is technically lower, but in practical terms is very resilient.
HDFS Challenges and Alternative Storage
- Name node is a single point of failure, even with Warm standby
- Direct Attached Storage (i.e. local hard drives) are hard to manage
- Most enterprises have significant investments in networked storage and need to leverage that investment
- HDFS replication creates three instances of everything by default (i.e. original and two copies), which can balloon storage needs
- Swap out HDFS with an API-compatible, enterprise alternative