Hortonworks HDPCA Certification Dumps

Question No : 1

Workflows expressed in Oozie can contain:

A. Sequences of MapReduce and Pig. These sequences can be combined with other

actions including forks, decision points, and path joins.

B. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce

sequences can be combined with forks and path joins.

C. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of

actions with exception handlers but no forks.

D. Iterntive repetition of MapReduce jobs until a desired answer or state is reached.

Answer: A

Explanation: Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig

jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence

of actions execution. This graph is specified in hPDL (a XML Process Definition Language).

hPDL is a fairly compact language, using a limited amount of flow control and action nodes.

Control nodes define the flow of execution and include beginning and end of a workflow

(start, end and fail nodes) and mechanisms to control the workflow execution path (

decision, fork and join nodes).

Workflow definitions

Currently running workflow instances, including instance states and variables

Reference: Introduction to Oozie

Note: Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and

uses a database to store:

Question No : 2

You are developing a combiner that takes as input Text keys, IntWritable values, and emits

Text keys, IntWritable values. Which interface should your class implement?

A. Combiner <Text, IntWritable, Text, IntWritable>

B. Mapper <Text, IntWritable, Text, IntWritable>

C. Reducer <Text, Text, IntWritable, IntWritable>

D. Reducer <Text, IntWritable, Text, IntWritable>

E. Combiner <Text, Text, IntWritable, IntWritable>

Answer: D

Question No : 3

Which TWO of the following statements are true regarding Hive? Choose 2 answers

A. Useful for data analysts familiar with SQL who need to do ad-hoc queries

B. Offers real-time queries and row level updates

C. Allows you to define a structure for your unstructured Big Data

D. Is a relational database

Answer: A,C

Question No : 4

You need to create a job that does frequency analysis on input data. You will do this by

writing a Mapper that uses TextInputFormat and splits each value (a line of text from an

input file) into individual characters. For each one of these characters, you will emit the

character as a key and an InputWritable as the value. As this will produce proportionally

more intermediate data than input data, which two resources should you expect to be

bottlenecks?

A. Processor and network I/O

B. Disk I/O and network I/O

C. Processor and RAM

D. Processor and disk I/O

Answer: B

Question No : 5

Which one of the following classes would a Pig command use to store data in a table

defined in HCatalog?

A. org.apache.hcatalog.pig.HCatOutputFormat

B. org.apache.hcatalog.pig.HCatStorer

C. No special class is needed for a Pig script to store data in an HCatalog table

D. Pig scripts cannot use an HCatalog table

Answer: B

Question No : 6

All keys used for intermediate output from mappers must:

A. Implement a splittable compression algorithm.

B. Be a subclass of FileInputFormat.

C. Implement WritableComparable.

D. Override isSplitable.

E. Implement a comparator for speedy sorting.

Answer: C

Explanation: The MapReduce framework operates exclusively on <key, value> pairs, that

is, the framework views the input to the job as a set of <key, value> pairs and produces a

set of <key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to

implement the Writable interface. Additionally, the key classes have to implement the

WritableComparable interface to facilitate sorting by the framework.

Reference: MapReduce Tutorial

Question No : 7

What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A. Algorithms that require applying the same mathematical function to large numbers of

individual binary records.

B. Relational operations on large amounts of structured and semi-structured data.

C. Algorithms that require global, sharing states.

D. Large-scale graph algorithms that require one-step link traversal.

E. Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Answer: C

Explanation: See 3) below.

Limitations of Mapreduce – where not to use Mapreduce

While very powerful and applicable to a wide variety of problems, MapReduce is not the

answer to every problem. Here are some problems I found where MapReudce is not suited

and some papers that address the limitations of MapReuce.

1. Computation depends on previously computed values

If the computation of a value depends on previously computed values, then MapReduce

cannot be used. One good example is the Fibonacci series where each value is summation

of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to

be computed on a single machine, then it is better to do it as a single reduce(map(data))

operation rather than going through the entire map reduce process.

2. Full-text indexing or ad hoc searching

The index generated in the Map step is one dimensional, and the Reduce step must not

generate a large amount of data or there will be a serious performance degradation. For

example, CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc

searching. This is a problem better suited for a tool such as Lucene.

3. Algorithms depend on shared global state

Solutions to many interesting problems in text processing do not require global

synchronization. As a result, they can be expressed naturally in MapReduce, since map

and reduce tasks run independently and in isolation. However, there are many examples of

algorithms that depend crucially on the existence of shared global state during processing,

making them difficult to implement in MapReduce (since the single opportunity for global

synchronization in MapReduce is the barrier between the map and reduce phases of

processing)

Reference: Limitations of Mapreduce – where not to use Mapreduce

Question No : 8

How are keys and values presented and passed to the reducers during a standard sort and

shuffle phase of MapReduce?

A. Keys are presented to reducer in sorted order; values for a given key are not sorted.

B. Keys are presented to reducer in sorted order; values for a given key are sorted in

ascending order.

C. Keys are presented to a reducer in random order; values for a given key are not sorted.

D. Keys are presented to a reducer in random order; values for a given key are sorted in

ascending order.

Answer: A

Explanation: Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have

output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they

are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application

should extend the key with the secondary key and define a grouping comparator. The keys

will be sorted using the entire key, but will be grouped using the grouping comparator to

decide which keys and values are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key,

(collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via

TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class

Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Question No : 9

Which best describes how TextInputFormat processes input files and line breaks?

A. Input file splits may cross line breaks. A line that crosses file splits is read by the

RecordReader of the split that contains the beginning of the broken line.

B. Input file splits may cross line breaks. A line that crosses file splits is read by the

RecordReaders of both splits containing the broken line.

C. The input file is split exactly at the line breaks, so each RecordReader will read a series

of complete lines.

D. Input file splits may cross line breaks. A line that crosses file splits is ignored.

E. Input file splits may cross line breaks. A line that crosses file splits is read by the

RecordReader of the split that contains the end of the broken line.

Answer: A

Reference: How Map and Reduce operations are actually carried out

Question No : 10

To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory.

What is the best way to accomplish this?

A. Serialize the data file, insert in it the JobConf object, and read the data into memory in

the configure method of the mapper.

B. Place the data file in the DistributedCache and read the data into memory in the map

method of the mapper.

C. Place the data file in the DataCache and read the data into memory in the configure

method of the mapper.

D. Place the data file in the DistributedCache and read the data into memory in the

configure method of the mapper.

Answer: C

Question No : 11

Consider the following two relations, A and B.

Which Pig statement combines A by its first field and B by its second field?

A. C = DOIN B BY a1, A by b2;

B. C = JOIN A by al, B by b2;

C. C = JOIN A a1, B b2;

D. C = JOIN A SO, B $1;

Answer: B

Question No : 12

What is the disadvantage of using multiple reducers with the default HashPartitioner and

distributing your workload across you cluster?

A. You will not be able to compress the intermediate data.

B. You will longer be able to take advantage of a Combiner.

C. By using multiple reducers with the default HashPartitioner, output files may not be in

globally sorted order.

D. There are no concerns with this approach. It is always advisable to use multiple

reduces.

Answer: C

Explanation: Multiple reducers and total ordering

If your sort job runs with multiple reducers (either because mapreduce.job.reduces in

mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r

option to specify the number of reducers on the command-line), then by default Hadoop will

use the HashPartitioner to distribute records across the reducers. Use of the

HashPartitioner means that you can’t concatenate your output files to create a single sorted

output file. To do this you’ll need total ordering,

Reference: Sorting text files with MapReduce

Question No : 13

Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application

containers and monitoring application resource usage?

A. ResourceManager

B. NodeManager

C. ApplicationMaster

D. ApplicationMasterService

E. TaskTracker

F. JobTracker

Answer: B

Reference: Apache Hadoop YARN – Concepts & Applications

Question No : 14

You have the following key-value pairs as output from your Map task:

(the, 1)

(fox, 1)

(faster, 1)

(than, 1)

(the, 1)

(dog, 1)

How many keys will be passed to the Reducer’s reduce method?

A. Six

B. Five

C. Four

D. Two

E. One

F. Three

Answer: B

Explanation: Only one key value pair will be passed from the two (the, 1) key value pairs.

Question No : 15

What data does a Reducer reduce method process?

A. All the data in a single input file.

B. All data produced by a single mapper.

C. All data for a given key, regardless of which mapper(s) produced it.

D. All data for a given value, regardless of which mapper(s) produced it.

Answer: C

Explanation: Reducing lets you aggregate values together. A reducer function receives an

iterator of input values from an input list. It then combines these values together, returning

a single output value.

All values with the same key are presented to a single reduce task.

Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce

Question No : 16

You want to count the number of occurrences for each unique word in the supplied input

data. You’ve decided to implement this by having your mapper tokenize each word and

emit a literal value 1, and then have your reducer increment a counter for each literal 1 it

receives. After successful implementing this, it occurs to you that you could optimize this by

specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in

this case and why or why not?

A. Yes, because the sum operation is both associative and commutative and the input and

output types to the reduce method match.

B. No, because the sum operation in the reducer is incompatible with the operation of a

Combiner.

C. No, because the Reducer and Combiner are separate interfaces.

D. No, because the Combiner is incompatible with a mapper which doesn’t use the same

data type for both the key and value.

E. Yes, because Java is a polymorphic object-oriented language and thus reducer code

can be reused as a combiner.

Answer: A

Explanation: Combiners are used to increase the efficiency of a MapReduce program.

They are used to aggregate intermediate map output locally on individual mapper outputs.

Combiners can help you reduce the amount of data that needs to be transferred across to

the reducers. You can use your reducer code as a combiner if the operation performed is

commutative and associative. The execution of combiner is not guaranteed, Hadoop may

or may not execute a combiner. Also, if required it may execute it more then 1 times.

Therefore your MapReduce jobs should not depend on the combiners execution.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What

are combiners? When should I use a combiner in my MapReduce Job?

Question No : 17

MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate

daemons? Select two.

A. Heath states checks (heartbeats)

B. Resource management

C. Job scheduling/monitoring

D. Job coordination between the ResourceManager and NodeManager

E. Launching tasks

F. Managing file system metadata

G. MapReduce metric reporting

H. Managing tasks

Answer: B,C

Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of

the JobTracker, resource management and job scheduling/monitoring, into separate

daemons. The idea is to have a global ResourceManager (RM) and per-application

ApplicationMaster (AM). An application is either a single job in the classical sense of Map-

Reduce jobs or a DAG of jobs.

Note:

The central goal of YARN is to clearly separate two things that are unfortunately smushed

together in current Hadoop, specifically in (mainly) JobTracker:

/ Monitoring the status of the cluster with respect to which nodes have which resources

available. Under YARN, this will be global.

/ Managing the parallelization execution of any specific job. Under YARN, this will be done

separately for each job.

Reference: Apache Hadoop YARN – Concepts & Applications

Question No : 18

You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt,

.third.txt and #data.txt. How many files will be processed by the

FileInputFormat.setInputPaths () command when it's given a path object representing this

directory?

A. Four, all files will be processed

B. Three, the pound sign is an invalid character for HDFS file names

C. Two, file names with a leading period or underscore are ignored

D. None, the directory cannot be named jobdata

E. One, no special characters can prefix the name of an input file

Answer: C

Explanation: Files starting with '_' are considered 'hidden' like unix files starting with '.'.

# characters are allowed in HDFS file names.

Question No : 19

Assuming default settings, which best describes the order of data provided to a reducer’s

reduce method:

A. The keys given to a reducer aren’t in a predictable order, but the values associated with

those keys always are.

B. Both the keys and values passed to a reducer always appear in sorted order.

C. Neither keys nor values are in any predictable order.

D. The keys given to a reducer are in sorted order but the values associated with each key

are in no predictable order

Answer: D

Explanation: Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have

output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they

are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application

should extend the key with the secondary key and define a grouping comparator. The keys

will be sorted using the entire key, but will be grouped using the grouping comparator to

decide which keys and values are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key,

(collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via

TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class

Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Question No : 20

Given the following Pig commands:

Which one of the following statements is true?

A. The $1 variable represents the first column of data in 'my.log'

B. The $1 variable represents the second column of data in 'my.log'

C. The severe relation is not valid

D. The grouped relation is not valid

Answer: B

Question No : 21

You have written a Mapper which invokes the following five calls to the

OutputColletor.collect method:

output.collect (new Text (“Apple”), new Text (“Red”) ) ;

output.collect (new Text (“Banana”), new Text (“Yellow”) ) ;

output.collect (new Text (“Apple”), new Text (“Yellow”) ) ;

output.collect (new Text (“Cherry”), new Text (“Red”) ) ;

output.collect (new Text (“Apple”), new Text (“Green”) ) ;

How many times will the Reducer’s reduce method be invoked?

A. 6

B. 3

C. 1

D. 0

E. 5

Answer: B

Explanation: reduce() gets called once for each [key, (list of values)] pair. To explain, let's

say you called:

out.collect(new Text("Car"),new Text("Subaru");

out.collect(new Text("Car"),new Text("Honda");

out.collect(new Text("Car"),new Text("Ford");

out.collect(new Text("Truck"),new Text("Dodge");

out.collect(new Text("Truck"),new Text("Chevy");

Then reduce() would be called twice with the pairs

reduce(Car, <Subaru, Honda, Ford>)

reduce(Truck, <Dodge, Chevy>)

Reference: Mapper output.collect()?

Question No : 22

Assuming the following Hive query executes successfully:

Which one of the following statements describes the result set?

A. A bigram of the top 80 sentences that contain the substring "you are" in the lines column

of the input data A1 table.

B. An 80-value ngram of sentences that contain the words "you" or "are" in the lines

column of the inputdata table.

C. A trigram of the top 80 sentences that contain "you are" followed by a null space in the

lines column of the inputdata table.

D. A frequency distribution of the top 80 words that follow the subsequence "you are" in the

lines column of the inputdata table.

Answer: D

Question No : 23

For each input key-value pair, mappers can emit:

A. As many intermediate key-value pairs as designed. There are no restrictions on the

types of those key-value pairs (i.e., they can be heterogeneous).

B. As many intermediate key-value pairs as designed, but they cannot be of the same type

as the input key-value pair.

C. One intermediate key-value pair, of a different type.

D. One intermediate key-value pair, but of the same type.

E. As many intermediate key-value pairs as designed, as long as all the keys have the

same types and all the values have the same type.

Answer: E

Explanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The

transformed intermediate records do not need to be of the same type as the input records.

A given input pair may map to zero or many output pairs.

Reference: Hadoop Map-Reduce Tutorial

Question No : 24

In Hadoop 2.2, which one of the following statements is true about a standby NameNode?

The Standby NameNode:

A. Communicates directly with the active NameNode to maintain the state of the active

NameNode.

B. Receives the same block reports as the active NameNode.

C. Runs on the same machine and shares the memory of the active NameNode.

D. Processes all client requests and block reports from the appropriate DataNodes.

Answer: B

Question No : 25

Which HDFS command copies an HDFS file named foo to the local filesystem as localFoo?

A. hadoop fs -get foo LocalFoo

B. hadoop -cp foo LocalFoo

C. hadoop fs -Is foo

D. hadoop fs -put foo LocalFoo

Answer: A

Question No : 26

Given the following Hive command:

Which one of the following statements is true?

A. The files in the mydata folder are copied to a subfolder of /apps/hlve/warehouse

B. The files in the mydata folder are moved to a subfolder of /apps/hive/wa re house

C. The files in the mydata folder are copied into Hive's underlying relational database

D. The files in the mydata folder do not move from their current location In HDFS

Answer: D

Question No : 27

You want to run Hadoop jobs on your development workstation for testing before you

submit them to your production cluster. Which mode of operation in Hadoop allows you to

most closely simulate a production cluster while using a single machine?

A. Run all the nodes in your production cluster as virtual machines on your development

workstation.

B. Run the hadoop command with the –jt local and the –fs file:///options.

C. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single

machine.

D. Run simldooop, the Apache open-source software for simulating Hadoop clusters.

Answer: C

Question No : 28

To use a lava user-defined function (UDF) with Pig what must you do?

A. Define an alias to shorten the function name

B. Pass arguments to the constructor of UDFs implementation class

C. Register the JAR file containing the UDF

D. Put the JAR file into the user's home folder in HDFS

Answer: C

Question No : 29

Which two of the following statements are true about Pig's approach toward data? Choose

2 answers

A. Accepts only data that has a key/value pair structure

B. Accepts data whether it has metadata or not

C. Accepts only data that is defined by metadata tables stored in a database

D. Accepts tab-delimited text data only

E. Accepts any data: structured or unstructured

Answer: B,E

Question No : 30

Your cluster’s HDFS block size in 64MB. You have directory containing 100 plain text files,

each of which is 100MB in size. The InputFormat for your job is TextInputFormat.

Determine how many Mappers will run?

A. 64

B. 100

C. 200

D. 640

Answer: C

Explanation: Each file would be split into two as the block size (64 MB) is less than the file

size (100 MB), so 200 mappers would be running.

Note:

If you're not compressing the files then hadoop will process your large files (say 10G), with

a number of mappers related to the block size of the file.

Say your block size is 64M, then you will have ~160 mappers processing this 10G file

(160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an

acceptable blocks size, but if you find that your mappers are executing in sub minute times,

then you might want to increase the work done by each mapper (by increasing the block

size to 128, 256, 512m - the actual size depends on how you intend to process the data).

Reference: http://stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriateinput-

files-size (first answer, second paragraph)

Question No : 31

You need to perform statistical analysis in your MapReduce job and would like to call

methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte

Java archive (JAR) file. Which is the best way to make this library available to your

MapReducer job at runtime?

A. Have your system administrator copy the JAR to all nodes in the cluster and set its

location in the HADOOP_CLASSPATH environment variable before you submit your job.

B. Have your system administrator place the JAR file on a Web server accessible to all

cluster nodes and then set the HTTP_JAR_URL environment variable to its location.

C. When submitting the job on the command line, specify the –libjars option followed by the

JAR file path.

D. Package your code and the Apache Commands Math library into a zip file named

JobJar.zip

Answer: C

Explanation: The usage of the jar command is like this,

Usage: hadoop jar <jar> [mainClass] args...

If you want the commons-math3.jar to be available for all the tasks you can do any one of

these

1. Copy the jar file in $HADOOP_HOME/lib dir

2. Use the generic option -libjars.

Question No : 32

Which Hadoop component is responsible for managing the distributed file system

metadata?

A. NameNode

B. Metanode

C. DataNode

D. NameSpaceManager

Answer: A

Question No : 33

Which one of the following statements describes the relationship between the

NodeManager and the ApplicationMaster?

A. The ApplicationMaster starts the NodeManager in a Container

B. The NodeManager requests resources from the ApplicationMaster

C. The ApplicationMaster starts the NodeManager outside of a Container

D. The NodeManager creates an instance of the ApplicationMaster

Answer: D

Question No : 34

Given a directory of files with the following structure: line number, tab character, string:

Example:

1abialkjfjkaoasdfjksdlkjhqweroij

2kadfjhuwqounahagtnbvaswslmnbfgy

3kjfteiomndscxeqalkzhtopedkfsikj

You want to send each line as one record to your Mapper. Which InputFormat should you

use to complete the line: conf.setInputFormat (____.class) ; ?

A. SequenceFileAsTextInputFormat

B. SequenceFileInputFormat

C. KeyValueFileInputFormat

D. BDBInputFormat

Answer: C

Explanation:

http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-inhadoop

2 comments:

DumpsPass4sure said...: I had an awesome experience at Dumpspass4sure under the supervision of highly qualified experts who have created Pass4sure CompTIA dumps. This short study guide shortly deals with all the syllabus contents that helped me to cover my syllabus within the provided time frame. I am so much filled with pleasure after achieving guaranteed success with CompTIA questions and answers.; 9/30/2019 11:35:00 AM
Babit said...: Thanks for sharing such information with us.
Hadoop Certification in Pune; 3/05/2021 12:52:00 PM

Service Now - Technical Stuff

Search This Blog

Hortonworks HDPCA Certification Dumps

2 comments:

Total Pageviews

Get In Touch

SNOW Tech