Tuesday 18 November 2014

A whole new world of Big Data- Introduction to Hadoop

Overview: 
1. Journey from traditional data base to big data.
2. Introduction.
3. Introduction to Big data storage and processing Framework: Hadoop.
4. The Master-slave Architecture.
5. HDFS (Hadoop Distributed File System): How a file is stored.
6. Map Reduce : How a file is retrieved.


Let's Begin:
Journey from traditional data base to big data

Dredging up the memories of mid 1970’s when memory space was a major concern for an enterprise- RDBMS (Relational database management system) dominated the world of Database. Relational databases were structured and normalized and sufficed business needs by providing good analytic s on the chunk of data. Gradually Business grew; data size increased rapidly and then came the need for business to get more shrewd and precise insights not only on its transactional data but also on its historical data. No doubt, Relational databases were the best solution to any enterprise for analysis on its transactional data but then they failed to deliver analysis on a larger dataset including the historical data by keeping up with its performance. Then the term Data warehouse was coined in mid-1980’s with different data models that focuses majorly on multidimensional concept. The data models under data warehouse were answer to business questions, though structured they were not normalized like a relational database but provided good analysis on historical data facilitating business to take correct decision based on the holistic dataset. The simplicity put up with good performance of OLAP (Online analytical processing) systems skyrocketed its demand and today almost all the enterprises have it implemented next to their RDBMS systems. Since then till today the combination of RDBMS popularly known as OLTP (online transactional processing) and OLAP systems have served the business whims pretty well until the explosion of data through social media, sensors, GPS devices, Mobile Phones, smart cards etc. has prompted business to look for more sophisticated solution that can handle more data without failing at its performance. Hence the whole new era of big data comes into existence.

Big data as defined by Wikipedia is the term for collection of dataset so large and complex that it becomes difficult to process using on-hand databases management tools or traditional databases application.

In a very short time the word Big Data has dispersed into the global market and has become a hot topic of almost all enterprise level discussion. Many multi-billion dollar companies like Amazon, Facebook and Google etc. have implemented similar concepts like that of big data analytic s and those who have not have included big data analytic s in their future road map.


Introduction:

The three V’s- Volume, Velocity and Variety are considered as three major aspects of big data. Machine generated data are produced at a much higher density than any ERP and CRM systems, For instance a single jet engine can generate 5 GB of data in one second, just imagine how much volume of data would 20,000 flights per day produce. Social media sites are currently viral over internet and the velocity in which the data comes in is pretty hard to structure, For Instance at an average rate 5,700 tweets are fired per second from a single social media site. The variety of data like audio, video, blogs, photo, web, mobile application adds to the complexity of the database design. In 2009 the data was estimated to be 1 ZT and in 2020 it is estimated to reach 35 ZT. MGI estimates that enterprises globally stored more than 7 Exabyte’s of new data on disk drives in 2010, while consumers stored more than 6 Exabyte’s of new data on devices such as PCs and notebooks.

Potential traditional database systems are not compatible enough to process such large volume of unstructured data; hence an absolute need of some database systems that is capable enough to process these chunks of data comes into picture. Big data’s demand is burgeoning in the global economy.

Data can be classified into structure and unstructured data. Structured data is the data for which we have different data models defined and business analytics are majorly performed on this chunk of data whereas unstructured data is the chunk that is disposed and have no defined data model. Computer world states that unstructured information might account for more than 70-80% of all data in organization.


Introduction to big data storage and processing framework:

Big data can be structured, semi-structured or even unstructured. Now the question arises, how do we store them? If not OLTP and OLAP systems then what? How to stash the large volume of varying data coming at a higher velocity, and then how to retrieve it when needed without failing at performance?

Well, the solution to the given questions existed among us for more than a decade but has got much attention lately because what used to be a petty problem which very few organization faced became much more cosmopolitan with the introduction of social media and cloud computing. When we think of Big data the first thing that crosses our mind is the big elephant ‘Hadoop’. Hadoop founded by Dough cutting has its origin in Apache Nutch, an open Web search engine that was started in 2002. Apache Nutch in 2004 introduced its NDFS (Nutch distributed file system) based on the GFS (Google file distributed system) that could scale to billions of pages on the Web. In 2004 Google introduced MapReduce to the world and by the mid of 2005 all of the major Nutch algorithms had been ported to run on MapReduce and NDFS. The implementation of NDFS and MapReduce in Nutch showed extraordinary result at searching larger chunk of data with good performance and hence in Feb 2006 the Nutch was moved to form an independent subproject called Hadoop. In 2009 Hadoop broke a world record to become the fastest system to sort one terabyte of data; the time taken was 62 seconds.

Hadoop like any of us has a heart and a brain, brain being the HDFS (hadoop distributed file system) and heart being the Map-reduce job. The whole world of hadoop revolves around these two concept and they are the one that makes hadoop unique, fault tolerant and inquisitive. Hadoop in simple words is a framework that supports running application on large clusters that provides a distributed file system to supports parallel storage of data into several nodes providing very high aggregation bandwidth across the cluster. With enterprises having large datacenter’s, memory space today is no longer a concern with performance at stake; hence it is wise for any enterprise to have copies of their data across the cluster to avoid data loss during node failure. A Hadoop cluster can have as much as 4000 nodes spread across many server racks or the cluster.



The Master Slave Architecture:

HDFS is fault-tolerant and is designed to be deployed on low-cost hardware. A typical HDFS architecture contains a client that distributes the data file into different blocks, writes the block into a node and then replicates the data into some other nodes to avoid data loss during machine failure. In simple words the client manages the reads and writes. Like no job can be perfectly accomplished without assistance, the client, to carry out the job seek help from many different applications present in the cluster like Name node, Secondary name node, Job tracker, task tracker and data node.

Hadoop in a nutshell is made up of:
1. HDFS
2. MapReduce

Both HDFS and MapReduce follow the master-slave architecture. HDFS consists of namenode, secondary name node, and data nodes and MapReduce consisting of Job Tracker and Task tracker.

Name Node: The name node plays the role of the master, contains the information (metadata information) about all the blocks present in the data nodes. It maintains multiple copies of two files known as fsimage and edit logs. Edit logs persistently record every change that occurs to the file system metadata for example creating a new file in HDFS inserts a new record into edit logs indicating the change. Inordinate changes in HDFS results in insertion of many records into the edit logs which in-turn increases the size of the log which again results in delays in name node restart. Hence to avoid this situation after some stipulated time the eidt logs dumps its data into fsimage, this process is called checkpointing.

Secondary name node: Many people mistake secondary name node as the backup node which is incorrect. Secondary name node which is a standby node contains the snapshot of the latest fsimage and editlog of the name node, when the name node goes down, after it is brought up the fsimage and editlog files of name node are synched with that of secondary name node.  Secondary name node serves to keep the metadata information intact while the name node goes down.

Data nodes: Data nodes play the role of slaves that stores the actual data in form of blocks. With default block size being 64 MB.

Job Tracker: Like name node is the master in the HDFS architecture, the job tracker is the master node in the MapReduce architecture. The job tracker collects the metadata information from the name node and commands the tasktrackers to perform their respective jobs in their respective data node.

Task Tracker: Task tracker are the slaves in the MapReduce architecture which perform the MapReduce job in their respective data nodes and apprises the job tracker- the status of their task.
A node can have multiple task tracker whereas a task tracker can have multiple slots and each slot can perform either of a map or a reduce task. The default number of slots depends on the number of CPU core. For example if we have a CPU of 8 cores then we can perform 1 map 7 reduce or 7 map 1 reduce or even 8 map and so on.

 Rack Awareness: Large HDFS instances runs on many clusters spread across many racks. Communication between two nodes go through switches between two racks. The network bandwidth between machines in the same rack is comparably much more that the network bandwidth between  machines present in different racks. Hence HDFS replicates the data block into two machines in one rack and one in some remote rack availing fast storage and retrieval while maintaining a backup at times of rack failure.
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request.



HDFS (How a file is stored):

Now let us have a look on how a file is distributed around the cluster and how the data is retrieved from the cluster. Let’s assume we have 3 server rack architecture with 5 nodes (Machines) per rack and 1 GB of file to be written into the HDFS.

1. The client divides the 1 GB file into 16 different blocks with block size being 64 MB each (The default block size is 64 MB).

2. The client then consults the name node about the data nodes in which the block data is supposed to be dumped; in return the name node uses its rack awareness step to determine the nodes. Ideally Hadoop suggests each block to be written into 3 nodes- 2 nodes being of one rack and 1 node of another.

Consulting the figure below: For instance, let’s suppose block 1 is to be written into node 1 of server rack 1 and node 6 and node 9 of server rack 2. Before the client starts its writing process it confirms that the given nodes are ready to receive the block or not. The client picks the first data node (node 1)opens a connection and tells the node to get ready to receive the block, The client also sends the list of other two nodes (node 6 and node 9) and asks the node (node 1) to confirm their status too. Node 1 then tells node 6 to get ready and asks node 6 to confirm the status of node 9, similarly node 6 tells node 9 to be ready and in return gets confirmation from node 9.



1.1 HDFS Architecture



After the connection is established between the nodes an acknowledgment is sent back to the respected nodes confirming the readiness of their succeeding node. When all other nodes signal the green light the initial node (node 1) sends a ready message to the client and a replication pipeline is formed between the nodes and the block is transferred to data node 1. The replication pipeline helps the nodes to transfer data to the succeeding node and receive data from the preceding node at the same time, hence saving time during replication.

3. After the data dump completes the client is apprised of the completion by the very same connection chain and the connection pool is closed. The client then send an update of the nodes in which the block is written to the name node and the name node updates its metadata information (the address of the block with a index) of the nodes location with the block data. Note: All the HDFS communication protocols are layered upon TCP/IP protocol.




Map Reduce:

Now since we have got a clue of how a block of data is stored in the cluster, Let us try to understand how the data from the block is retrieved. Retrieval and aggregation on the data in done by the MapReduce job.

1. Let’s say a text file is divided into 4 blocks and has employee records and we are interested in finding the employee named “SAM” and his total income. As usual the client takes the initiative and submits the MapReduce job to the job tracker.

2. The job tracker then reaches out to the name node inquiring about the nodes in which the blocks of the file are written. After the job tracker is apprised of the relevant nodes it provides task trackers (running locally on that node) with the code to perform the map task. Note: The information we are looking for can be stashed in one or many blocks depending on the data volume, hence it is imperative to traverse through all the relevant blocks and get the resultant data.

3. The task tracker of each node will allocate a slot for a map job to occur in their respective node containing the relevant blocks. Multiple blocks can be read at the same time given how many map jobs are taking place in the node simultaneously which again depends on how many slots are there for the node. As each map task completes, the node stores the resultant data in its local storage called the intermediate storage. Let’s say node 1 stores “SAM” “$25,000”, node 11 stores “SAM” “$1, 05,000” After the map task is complete the job tracker will start the reduce job in any of the data node (node 6). The job tracker asks the data node to go get the intermediate data from each node and perform the computation. The data node (which performs the reduce task) will get the data from the data nodes and store the resultant output i.e. “SAM” “$1, 30,000” into a resultant file and writes the file into HDFS. The resultant file can be read by the client machine.



1.2 MapReduce Task


No comments:

Post a Comment