Big Data and Hadoop Interview Questions and Answers
What is BIG DATA?
Big Data represents a huge and complex data that is difficult to capture, store, process, retrieve and analyze with the help of on-hand traditional database management tools.
What are the three major characteristics of Big Data?
According to IBM, the three characteristics of Big Data are:
Volume: Facebook generating 500+ terabytes of data per day.
Velocity: Analyzing 2 million records each day to identify the reason for losses.
Variety: images, audio, video, sensor data, log files, etc.
What is Hadoop?
Hadoop is a framework that allows distributed processing of large data sets across clusters of commodity hardware(computers) using a simple programming model.
What is the basic difference between traditional RDBMS and Hadoop?
Traditional RDBMS is used for transactional systems to store and process the data, whereas Hadoop is used to store and process large amount of data in the distributed file system.
What are the basic components of Hadoop?
HDFS and MapReduce are the basic components of hadoop.
HDFS is used to store large data sets and MapReduce is used to process such large data sets.
What is HDFS?
HDFS stands for Hadoop Distributed File System and it is designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
What is Map Reduce?
Map Reduce is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters
How Map Reduce works in Hadoop?
MapReduce distributes the workload into two different jobs namely 1. Map job and 2. Reduce job that can run in parallel.
1.The Map job breaks down the data sets into key-value pairs or tuples.
2.The Reduce job then takes the output of the map job and combines the data tuples into smaller set of tuples.
What is a Name node?
Name node is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the data nodes. It is a high-availability machine and single point of failure in HDFS.
What is a Data node?
Data nodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.
What is a job tracker?
Job tracker is a daemon that runs on a name node for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. If the job tracker goes down all the running jobs are halted.
How job tracker works?
When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks.
What is a task tracker?
Task tracker is also a daemon that runs on data nodes. Task Trackers manage the execution of individual tasks on slave node.
How task tracker works?
Task tracker is majorly responsible to execute the work assigned by the job tracker and while performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat.
What is Heart beat?
Task tracker communicate with job tracker by sending heartbeat based on which Job tracker decides whether the assigned task is completed or not. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.
Is Namenode machine same as datanode machine as in terms of hardware?
It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and datanodes are on different machines.
What is a commodity hardware?
Commodity hardware is a non-expensive systems which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop.
Is Namenode also a commodity?
No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.
What is a metadata?
Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.
Are Namenode and job tracker on the same host?
No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.
What is a ‘block’ in HDFS?
A ‘block’ is the minimum amount of data of default block size 64MB that can be read or written from or to the HDFS.
If a data Node is full how it’s identified?
When data is stored in datanode, then the metadata of that data will be stored in
the Namenode. So Namenode will identify if the data node is full.
If datanodes increase, then do we need to upgrade Namenode?
While installing the Hadoop system, Namenode is determined based on the size of
the clusters. Most of the time, we do not need to upgrade the Namenode because
it does not store the actual data, but just the metadata, so such a requirement
On what basis Namenode will decide which datanode to write on?
As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.
Is client the end user in HDFS?
No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).
What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
What is Hadoop Single Point Of Failure (SPOF)
If the Namenode fails, the entire Hadoop system goes down. This is called Hadoop Single Point Of Failure.
What is a Secondary Namenode?
The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system.
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1.standalone (local) mode
3.Fully distributed mode
What are the features of Stand alone (local) mode?
In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the most least used environments.
What are the features of Pseudo mode?
Pseudo mode is used both for development and in the QA environment. In the
Pseudo mode all the daemons run on the same machine.
Can we call VMs as pseudos?
No, VMs are not pseudos because VM is something different and pesudo is very
specific to Hadoop.
What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’
number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster
of machines. There is one host onto which Namenode is running and another host
on which datanode is running and then there are machines on which task tracker
is running. We have separate masters and separate slaves in this distribution.