Hadoop MapReduce Interview Questions & Answers

Hadoop MapReduce Interview Questions

Experienced in Hadoop? Then this is for you. MapReduce is the programming model fit for processing of large amounts of data. Hadoop is proficient of running MapReduce programs written in various languages like Python, Java, Ruby and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. Do search in wisdom jobs for Hadoop MapReduce job listings for full time and part time positions updated today. Check out interview questions page to get more information on the kind and level of questions you will come across during the interview. Wisdomjobs Hadoop MapReduce job interview questions and answers provides you with complete guide and makes you prepared for winning job interview.


Hadoop MapReduce Interview Questions And Answers

Hadoop MapReduce Interview Questions
    1. Question 1. What Is Hadoop Mapreduce ?

      Answer :

      MapReduce is a set of programs used to process or analyze vast of data over a Hadoop cluster. It process the vast amount of the datasets parallelly across the clusters in a fault­tolerant manner across the Hadoop framework.

    2. Question 2. Can You Elaborate About Mapreduce Job?

      Answer :

      Based on the configuration, the MapReduce Job first splits the input data into independent chunks called Blocks. These blocks processed by Map() and Reduce() functions. First Map function process the data, then processed by reduce function. The Framework takes care of sorts the Map outputs, scheduling the tasks. 

    3. Question 3. Why Compute Nodes And The Storage Nodes Are The Same?

      Answer :

      Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code execute where the data is stored, so the data node and compute node are the same. 

    4. Question 4. What Is The Configuration Object Importance In Mapreduce?

      Answer :

      It’s used to set/get of parameter name & value pairs in XML file.It’s used to initialize values, read from external file and set as a value parameter.Parameter values in the program always overwrite with new values which are coming from external configure files.Parameter values received from Hadoop’s default values.

    5. Question 5. Where Mapreduce Not Recommended?

      Answer :

      Mapreduce is not recommended for Iterative kind of processing. It means repeat the output in a loop manner.To process Series of Mapreduce jobs, MapReduce not suitable. each job persists data in local disk, then again load to another job. It’s costly operation and not recommended.

    6. Question 6. What Is Namenode And It’s Responsibilities?

      Answer :

      Namenode is a logical daemon name for a particular node. It’s heart of the entire Hadoop system. Which store the metadata in FsImage and get all block information in the form of Heartbeat.

    7. Question 7. What Is Jobtracker’s Responsibility?

      Answer :

      Scheduling the job’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker. Monitoring the tasks, if failed, re­execute the failed tasks. 

    8. Question 8. What Are The Jobtracker & Tasktracker In Mapreduce?

      Answer :

      MapReduce Framework consists of a single Job Tracker per Cluster, one Task Tracker per node. Usually A cluster has multiple nodes, so each cluster has single Job Tracker and multiple TaskTrackers.JobTracker can schedule the job and monitor the Task Trackers. If Task Tracker failed to execute tasks, try to re-execute the failed tasks.

      TaskTracker follow the JobTracker’s instructions and execute the tasks. As a slave node, it report the job status to Master JobTracker in the form of Heartbeat. 

    9. Question 9. What Is Job Scheduling Importance In Hadoop Mapreduce?

      Answer :

      Scheduling is a systematic procedure of allocating resources in the best possible way among multiple tasks. Hadoop task tracker performing many procedures, sometime a particular procedure should finish quickly and provide more prioriety, to do it few job schedulers come into the picture. Default Schedule is FIFO. Fair scheduling, FIFO and CapacityScheduler are most popular hadoop scheduling in hadoop. 

    10. Question 10. When Used Reducer?

      Answer :

      To combine multiple mapper’s output used reducer. Reducer has 3 primary phases sort, shuffle and reduce. It’s possible to process data without reducer, but used when the shuffle and sort is required. 

    11. Question 11. What Is Replication Factor?

      Answer :

      A chunk of data is stored in different nodes with in a cluster called replication factor. By default replication value is 3, but it’s possible to change it. Automatically each file is split into blocks and spread across the cluster. 

    12. Question 12. Where The Shuffle And Sort Process Does?

      Answer :

      After Mapper generate the output temporary store the intermediate data on the local File System. Usually this temporary file configured at core­site.xml in the Hadoop file. Hadoop Framework aggregate and sort this intermediate data, then update into Hadoop to be processed by the Reduce function. The Framework deletes this temporary data in the local system after Hadoop completes the job.

    13. Question 13. Java Is Mandatory To Write Mapreduce Jobs?

      Answer :

      No, By default Hadoop implemented in JavaTM, but MapReduce applications need not be written in Java. Hadoop support Python, Ruby, C++ and other Programming languages. Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.Hadoop Pipes allows programmers to implement MapReduce applications by using C++ programs.

    14. Question 14. What Methods Can Controle The Map And Reduce Function’s Output?

      Answer :

      setOutputKeyClass() and setOutputValueClass()

      If they are different, then the map output type can be set using the methods.

      setMapOutputKeyClass() and setMapOutputValueClass()

    15. Question 15. What Is The Main Difference Between Mapper And Reducer?

      Answer :

      • Map method is called separately for each key/value have been processed. It process input key/value pairs and emits intermediate key/value pairs.
      • Reduce method is called separately for each key/values list pair. It process intermediate key/value pairs and emits final key/value pairs.
      • Both are initialize and called before any other method is called. Both don’t have any parameters and no output.

    16. Question 16. Why Compute Nodes And The Storage Nodes Are Same?

      Answer :

      Compute nodes are logical processing units, Storage nodes are physical storage units (Nodes). Both are running in the same node because of “data locality” issue. As a result Hadoop minimize the data network wastage and allows to process quickly.

    17. Question 17. What Is Difference Between Mapside Join And Reduce Side Join? Or When We Goes To Mapside Join And Reduce Join?

      Answer :

      Join multple tables in mapper side, called map side join. Please note mapside join should has strict format and sorted properly. If dataset is smaller tables, goes through reducer phrase. Data should partitioned properly.

      Join the multiple tables in reducer side called reduce side join. If you have large amount of data tables, planning to join both tables. One table is large amount of rows and columns, another one has few number of tables only, goes through Rreduce side join. It’s the best way to join the multiple tables

    18. Question 18. What Happen If Number Of Reducer Is 0?

      Answer :

      Number of reducer = 0 also valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output consider as output, Hadoop store this information in separate folder.

    19. Question 19. When We Are Goes To Combiner? Why It Is Recommendable?

      Answer :

      Mappers and reducers are independent they dont talk each other. When the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} we goes to combiner to optimize the mapreduce process. Many mapreduce jobs are limited by the bandwidth, so by default Hadoop framework minimizes the data bandwidth network wastage. To achieve it’s goal, Mapreduce allows user defined “Cominer function” to run on the map output. It’s an MapReduce optimization technique, but it’s optional.

    20. Question 20. What Is The Main Difference Between Mapreduce Combiner And Reducer?

      Answer :

      Both Combiner and Reducer are optional, but most frequently used in MapReduce. There are three main differences such as:

      • combiner will get only one input from one Mapper. While Reducer will get multiple mappers from different mappers.
      • If aggregation required used reducer, but if the function follows commutative (a.b=b.a) and associative a.(b.c)= (a.b).c law, use combiner.
      • Input and output keys and values types must same in combiner, but reducer can follows any type input, any output format. 

    21. Question 21. What Is Combiner?

      Answer :

      It’s a logical aggregation of key and value pair produced by mapper. It’s reduces a lot amount of duplicated data transfer between nodes, so eventually optimize the job performance. The framework decides whether combiner runs zero or multiple times. It’s not suitable where mean function occurs.

    22. Question 22. What Is Partition?

      Answer :

      After combiner and intermediate map­output the Partitioner controls the keys after sort and shuffle. Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. It means each partition can executed by only a single reducer. If you call reducer, automatically partition called in reducer by automatically.

    23. Question 23. When We Goes To Partition?

      Answer :

      By default Hive reads entire dataset even the application have a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows special option called partitions. When you are creating table, hive partitioning the table based on requirement. 

    24. Question 24. What Are The Important Steps When You Are Partitioning Table?

      Answer :

      Don’t over partition the data with too small partitions, it’s overhead to the namenode.

      if dynamic partition, atleast one static partition should exist and set to strict mode by using given commands.

      SET hive.exec.dynamic.partition = true;
      SET hive.exec.dynamic.partition.mode = nonstrict;

      first load data into non­partitioned table, then load such data into partitioned table. It’s not possible to load data from local to partitioned table.
      insert overwrite table table_name partition(year) select * from non­partition­table;

    25. Question 25. Can You Elaborate Mapreduce Job Architecture?

      Answer :

      First Hadoop programmer submit Mpareduce program to JobClient.

      Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s in the form of Job_HadoopStartedtime_00001. It’s unique ID.

      Once JobClient receive received Job ID copy the Job resources (job.xml, job.jar) to File System (HDFS) and submit job to JobTracker. JobTracker initiate Job and schedule the job.

      Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrive the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.

    26. Question 26. Why Task Tracker Launch Child Jvm?

      Answer :

      Most frequently, hadoop developer mistakenly submit wrong jobs or having bugs. If Task Tracker use existent JVM, it may interrupt the main JVM, so other tasks may influenced. Where as child JVM if it’s trying to damage existent resources, TaskTracker kill that child JVM and retry or relaunch new child JVM. 

    27. Question 27. Why Jobclient, Job Tracker Submits Job Resources To File System?

      Answer :

      Data locality. Move competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resources copy where the data available.

    28. Question 28. How Many Mappers And Reducers Can Run?

      Answer :

      By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default values in Mapreduce.xml in conf file.

    29. Question 29. What Is Inputsplit?

      Answer :

      A chunk of data processed by a single mapper called InputSplit. In another words logical chunk of data which processed by a single mapper called Input split, by default inputSplit = block Size.

    30. Question 30. How To Configure The Split Value?

      Answer :

      By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.

      1. split size = min (max_splitsize, max (block_size, min_split_size));
      2. split size = max(min_split_size, min (block_size, max_split, size));

      by default split size = block size 
      Always No of splits = No of mappers.

      Apply above formula:

      • split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb split size = min (10gb (let assume), 64) split size = 64MB.
      • split size = max(min_split_size, min (block_size, max_split, size)); split size = max (512kb, min (64, 10GB)); split size = max (512kb, 64);split size = 64 MB;

    31. Question 31. How Much Ram Required To Process 64mb Data?

      Answer :

      Leg assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to process a chunk of data.So in this way required more memory to process un­structured process. 

    32. Question 32. What Is Difference Between Block And Split?

      Answer :

      • Block: How much chunk data to stored in the memory called block. 
      • Split: how much data to process the data called split. 

    33. Question 33. Why Hadoop Framework Reads A File Parallel Why Not Sequential?

      Answer :

      To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and write sequentially.

    34. Question 34. If I Am Change Block Size From 64 To 128, Then What Happen?

      Answer :

      Even you have changed block size not effect existent data. After changed the block size, every file chunked after 128 MB of block size. It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.

    35. Question 35. What Is Issplitable()?

      Answer :

      By default this value is true. It is used to split the data in the input format. if un­structured data, it’s not recommendable to split the data, so process entire file as a one split. to do it first change isSplitable() to false.

    36. Question 36. How Much Hadoop Allows Maximum Block Size And Minimum Block Size?

      Answer :

      • Minimum: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.
      • Maximum: Depends on environment. There is no upper­bound. 

    37. Question 37. What Are The Job Resource Files?

      Answer :

      job.xml and job.jar are core resources to process the Job. Job Client copy the resources to the HDFS. 

    38. Question 38. What’s The Mapreduce Job Consists?

      Answer :

      MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker

    39. Question 39. What Is The Data Locality?

      Answer :

      Whereever the data is there process the data, computation/process the data where the data available, this process called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal follow data locality. It’s possible when the data is splittable, by default it’s true. 

    40. Question 40. What Is Speculative Execution?

      Answer :

      Hadoop run the process in commodity hardware, so it’s possible to fail the systems also has low memory. So if system failed, process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique.Computation/logic distribute to the multiple systems and execute which system execute quickly. By default this value is true. Now even the system crashed, not a problem, framework choose logic from other systems. 

      Eg: logic distributed on A, B, C, D systems, completed within a time.

      System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process. 

    41. Question 41. When We Goes To Reducer?

      Answer :

      When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without reducer its possible to do this operation.

    42. Question 42. What Is Chain Mapper?

      Answer :

      Chain mapper class is a special mapper class sets which run in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way n number of mapper connected in chain fashion.

    43. Question 43. How To Do Value Level Comparison?

      Answer :

      Hadoop can process key level comparison only but not in the value level comparison. 

    44. Question 44. What Is Setup And Clean Up Methods?

      Answer :

      If you don’t no what is starting and ending point/lines, it’s much difficult to solve those problems. Setup and clean up can resolve it. N number of blocks, by default 1 mapper called to each split. each split has one start and clean up methods. N number of methods, number of lines. Setup is initialize job resources.

      The purpose of clean up is close the job resources. Map is process the data. once last map is completed, cleanup is initialized. It Improves the data transfer performance. All these block size comparison can do in reducer as well. If you have any key and value, compare one key value to another key value use it. If you compare record level used these setup and cleanup. It open once and process many times and close once. So it save a lot of network wastage during process. 

    45. Question 45. How Many Slots Allocate For Each Task?

      Answer :

      By default each task has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to process the data. 

    46. Question 46. Why Tasktracker Launch Child Jvm To Do A Task? Why Not Use Existent Jvm?

      Answer :

      Sometime child threads currupt parent threads. It means because of programmer mistake entired MapReduce task distruped. So task tracker launch a child JVM to process individual mapper or tasker. If tasktracker use existent JVM, it might damage main JVM. If any bugs occur, tasktracker kill the child process and relaunch another child JVM to do the same task. Usually task tracker relaunch and retry the task 4 times.

    47. Question 47. What Are The Main Components Of Mapreduce Job?

      Answer :

      • Main Driver Class: providing job configuration parameters
      • Mapper Class: must extend org.apache.hadoop.mapreduce.Mapper class and performs execution of map() method
      • Reducer Class: must extend org.apache.hadoop.mapreduce.Reducer class

    48. Question 48. What Main Configuration Parameters Are Specified In Mapreduce?

      Answer :

      The MapReduce programmers need to specify following configuration parameters to perform the map and reduce jobs:

      • The input location of the job in HDFs.
      • The output location of the job in HDFS.
      • The input’s and output’s format.
      • The classes containing map and reduce functions, respectively.
      • The .jar file for mapper, reducer and driver classes

    49. Question 49. What Is Partitioner And Its Usage?

      Answer :

      Partitioner is yet another important phase that controls the partitioning of the intermediate map-reduce output keys using a hash function. The process of partitioning determines in what reducer, a key-value pair (of the map output) is sent. The number of partitions is equal to the total number of reduce jobs for the process.

      Hash Partitioner is the default class available in Hadoop , which implements the following function.int getPartition(K key, V value, int numReduceTasks)

      The function returns the partition number using the numReduceTasks is the number of fixed reducers.

    50. Question 50. What Is Identity Mapper?

      Answer :

      Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class is defined, Identify will be executed. It only writes the input data into output and do not perform and computations and calculations on the input data. The class name is org.apache.hadoop.mapred.lib.IdentityMapper.

    51. Question 51. What Is Recordreader In A Map Reduce?

      Answer :

      RecordReader is used to read key/value pairs form the InputSplit by converting the byte-oriented view  and presenting record-oriented view to Mapper.

    52. Question 52. What Is Outputcommitter?

      Answer :

      OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce. It performs the following operations:

      • Create temporary output directory for the job during initialization.
      • Then, it cleans the job as in removes temporary output directory post job completion.
      • Sets up the task temporary output.
      • Identifies whether a task needs commit. The commit is applied if required.
      • JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.

    53. Question 53. What Are The Parameters Of Mappers And Reducers?

      Answer :

      The four parameters for mappers are:

      • LongWritable (input)
      • text (input)
      • text (intermediate output)
      • IntWritable (intermediate output)

      The four parameters for reducers are:

      • Text (intermediate output)
      • IntWritable (intermediate output)
      • Text (final output)
      • IntWritable (final output)

    54. Question 54. What Is A “reducer” In Hadoop?

      Answer :

      In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

    55. Question 55. What Is A “map” In Hadoop?

      Answer :

      In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

    56. Question 56. Explain Jobconf In Mapreduce?

      Answer :

      It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat implementations and other advanced job faets liek Comparators.

    57. Question 57. What Is Jobtracker?

      Answer :

      JobTracker is a Hadoop service used for the processing of MapReduce jobs  in the cluster. It submits and tracks the jobs to specific nodes having data. Only one JobTracker runs on single Hadoop cluster on its own JVM process. if JobTracker goes down, all the jobs halt.

    58. Question 58. Explain Job Scheduling Through Jobtracker?

      Answer :

      JobTracker communicates with NameNode to identify data location and submits the work to TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any job failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is still alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or mark a specific record as unreliable or blacklist it.

    59. Question 59. What Is Sequencefileinputformat?

      Answer :

      A compressed binary output file format to read in sequence files and extends the FileInputFormat.It passes data between output-input (between output of one MapReduce job to input of another MapReduce job)phases of MapReduce jobs.

    60. Question 60. How To Set Mappers And Reducers For Hadoop Jobs?

      Answer :

      Users can configure JobConf variable to set number of mappers and reducers.

      • job.setNumMaptasks()
      • job.setNumreduceTasks()

Popular Interview Questions

All Interview Questions

All Practice Tests

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop MapReduce Tutorial