Search This Blog

Hortonworks HDPCA Certification Dumps


Question No : 1
Workflows expressed in Oozie can contain:
A. Sequences of MapReduce and Pig. These sequences can be combined with other
actions including forks, decision points, and path joins.
B. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce
sequences can be combined with forks and path joins.
C. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of
actions with exception handlers but no forks.
D. Iterntive repetition of MapReduce jobs until a desired answer or state is reached.
Answer: A
Explanation: Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig
jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence
of actions execution. This graph is specified in hPDL (a XML Process Definition Language).
hPDL is a fairly compact language, using a limited amount of flow control and action nodes.
Control nodes define the flow of execution and include beginning and end of a workflow
(start, end and fail nodes) and mechanisms to control the workflow execution path (
decision, fork and join nodes).
Workflow definitions
Currently running workflow instances, including instance states and variables
Reference: Introduction to Oozie
Note: Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and
uses a database to store:

Question No : 2
You are developing a combiner that takes as input Text keys, IntWritable values, and emits
Text keys, IntWritable values. Which interface should your class implement?
A. Combiner <Text, IntWritable, Text, IntWritable>
B. Mapper <Text, IntWritable, Text, IntWritable>
C. Reducer <Text, Text, IntWritable, IntWritable>
D. Reducer <Text, IntWritable, Text, IntWritable>
E. Combiner <Text, Text, IntWritable, IntWritable>
Answer: D

Question No : 3
Which TWO of the following statements are true regarding Hive? Choose 2 answers
A. Useful for data analysts familiar with SQL who need to do ad-hoc queries
B. Offers real-time queries and row level updates
C. Allows you to define a structure for your unstructured Big Data
D. Is a relational database
Answer: A,C

Question No : 4
You need to create a job that does frequency analysis on input data. You will do this by
writing a Mapper that uses TextInputFormat and splits each value (a line of text from an
input file) into individual characters. For each one of these characters, you will emit the
character as a key and an InputWritable as the value. As this will produce proportionally
more intermediate data than input data, which two resources should you expect to be
bottlenecks?
A. Processor and network I/O
B. Disk I/O and network I/O
C. Processor and RAM
D. Processor and disk I/O
Answer: B


Question No : 5
Which one of the following classes would a Pig command use to store data in a table
defined in HCatalog?
A. org.apache.hcatalog.pig.HCatOutputFormat
B. org.apache.hcatalog.pig.HCatStorer
C. No special class is needed for a Pig script to store data in an HCatalog table
D. Pig scripts cannot use an HCatalog table
Answer: B

Question No : 6
All keys used for intermediate output from mappers must:
A. Implement a splittable compression algorithm.
B. Be a subclass of FileInputFormat.
C. Implement WritableComparable.
D. Override isSplitable.
E. Implement a comparator for speedy sorting.
Answer: C
Explanation: The MapReduce framework operates exclusively on <key, value> pairs, that
is, the framework views the input to the job as a set of <key, value> pairs and produces a
set of <key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Reference: MapReduce Tutorial

Question No : 7
What types of algorithms are difficult to express in MapReduce v1 (MRv1)?
A. Algorithms that require applying the same mathematical function to large numbers of
individual binary records.
B. Relational operations on large amounts of structured and semi-structured data.
C. Algorithms that require global, sharing states.
D. Large-scale graph algorithms that require one-step link traversal.
E. Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).
Answer: C
Explanation: See 3) below.
Limitations of Mapreduce – where not to use Mapreduce
While very powerful and applicable to a wide variety of problems, MapReduce is not the
answer to every problem. Here are some problems I found where MapReudce is not suited
and some papers that address the limitations of MapReuce.
1. Computation depends on previously computed values
If the computation of a value depends on previously computed values, then MapReduce
cannot be used. One good example is the Fibonacci series where each value is summation
of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to
be computed on a single machine, then it is better to do it as a single reduce(map(data))
operation rather than going through the entire map reduce process.
2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional, and the Reduce step must not
generate a large amount of data or there will be a serious performance degradation. For
example, CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc
searching. This is a problem better suited for a tool such as Lucene.
3. Algorithms depend on shared global state
Solutions to many interesting problems in text processing do not require global
synchronization. As a result, they can be expressed naturally in MapReduce, since map
and reduce tasks run independently and in isolation. However, there are many examples of
algorithms that depend crucially on the existence of shared global state during processing,
making them difficult to implement in MapReduce (since the single opportunity for global
synchronization in MapReduce is the barrier between the map and reduce phases of
processing)
Reference: Limitations of Mapreduce – where not to use Mapreduce

Question No : 8
How are keys and values presented and passed to the reducers during a standard sort and
shuffle phase of MapReduce?
A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in
ascending order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in
ascending order.
Answer: A
Explanation: Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have
output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they
are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application
should extend the key with the secondary key and define a grouping comparator. The keys
will be sorted using the entire key, but will be grouped using the grouping comparator to
decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Question No : 9
Which best describes how TextInputFormat processes input files and line breaks?
A. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader of the split that contains the beginning of the broken line.
B. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.
C. The input file is split exactly at the line breaks, so each RecordReader will read a series
of complete lines.
D. Input file splits may cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader of the split that contains the end of the broken line.
Answer: A
Reference: How Map and Reduce operations are actually carried out

Question No : 10
To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory.
What is the best way to accomplish this?
A. Serialize the data file, insert in it the JobConf object, and read the data into memory in
the configure method of the mapper.
B. Place the data file in the DistributedCache and read the data into memory in the map
method of the mapper.
C. Place the data file in the DataCache and read the data into memory in the configure
method of the mapper.
D. Place the data file in the DistributedCache and read the data into memory in the
configure method of the mapper.
Answer: C

Question No : 11
Consider the following two relations, A and B.
Which Pig statement combines A by its first field and B by its second field?
A. C = DOIN B BY a1, A by b2;
B. C = JOIN A by al, B by b2;
C. C = JOIN A a1, B b2;
D. C = JOIN A SO, B $1;
Answer: B

Question No : 12
What is the disadvantage of using multiple reducers with the default HashPartitioner and
distributing your workload across you cluster?
A. You will not be able to compress the intermediate data.
B. You will longer be able to take advantage of a Combiner.
C. By using multiple reducers with the default HashPartitioner, output files may not be in
globally sorted order.
D. There are no concerns with this approach. It is always advisable to use multiple
reduces.
Answer: C

Explanation: Multiple reducers and total ordering
If your sort job runs with multiple reducers (either because mapreduce.job.reduces in
mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r
option to specify the number of reducers on the command-line), then by default Hadoop will
use the HashPartitioner to distribute records across the reducers. Use of the
HashPartitioner means that you can’t concatenate your output files to create a single sorted
output file. To do this you’ll need total ordering,
Reference: Sorting text files with MapReduce

Question No : 13
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application
containers and monitoring application resource usage?
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. ApplicationMasterService
E. TaskTracker
F. JobTracker
Answer: B
Reference: Apache Hadoop YARN – Concepts & Applications

Question No : 14
You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
   9
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)

How many keys will be passed to the Reducer’s reduce method?
A. Six
B. Five
C. Four
D. Two
E. One
F. Three
Answer: B
Explanation: Only one key value pair will be passed from the two (the, 1) key value pairs.

Question No : 15
What data does a Reducer reduce method process?
A. All the data in a single input file.
B. All data produced by a single mapper.
C. All data for a given key, regardless of which mapper(s) produced it.
D. All data for a given value, regardless of which mapper(s) produced it.
Answer: C
Explanation: Reducing lets you aggregate values together. A reducer function receives an
iterator of input values from an input list. It then combines these values together, returning
a single output value.
All values with the same key are presented to a single reduce task.
Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce

Question No : 16
You want to count the number of occurrences for each unique word in the supplied input
data. You’ve decided to implement this by having your mapper tokenize each word and
emit a literal value 1, and then have your reducer increment a counter for each literal 1 it
receives. After successful implementing this, it occurs to you that you could optimize this by
specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in
this case and why or why not?
A. Yes, because the sum operation is both associative and commutative and the input and
output types to the reduce method match.
B. No, because the sum operation in the reducer is incompatible with the operation of a
Combiner.
C. No, because the Reducer and Combiner are separate interfaces.
D. No, because the Combiner is incompatible with a mapper which doesn’t use the same
data type for both the key and value.
E. Yes, because Java is a polymorphic object-oriented language and thus reducer code
can be reused as a combiner.
Answer: A
Explanation: Combiners are used to increase the efficiency of a MapReduce program.
They are used to aggregate intermediate map output locally on individual mapper outputs.
Combiners can help you reduce the amount of data that needs to be transferred across to
the reducers. You can use your reducer code as a combiner if the operation performed is
commutative and associative. The execution of combiner is not guaranteed, Hadoop may
or may not execute a combiner. Also, if required it may execute it more then 1 times.
Therefore your MapReduce jobs should not depend on the combiners execution.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What
are combiners? When should I use a combiner in my MapReduce Job?

Question No : 17
MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate
daemons? Select two.
A. Heath states checks (heartbeats)
B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting
H. Managing tasks
Answer: B,C
Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of
the JobTracker, resource management and job scheduling/monitoring, into separate
daemons. The idea is to have a global ResourceManager (RM) and per-application
ApplicationMaster (AM). An application is either a single job in the classical sense of Map-
Reduce jobs or a DAG of jobs.
Note:
The central goal of YARN is to clearly separate two things that are unfortunately smushed
together in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources
available. Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done
separately for each job.
Reference: Apache Hadoop YARN – Concepts & Applications

Question No : 18
You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt,
.third.txt and #data.txt. How many files will be processed by the
FileInputFormat.setInputPaths () command when it's given a path object representing this
directory?
A. Four, all files will be processed
B. Three, the pound sign is an invalid character for HDFS file names
C. Two, file names with a leading period or underscore are ignored
D. None, the directory cannot be named jobdata
E. One, no special characters can prefix the name of an input file
Answer: C
Explanation: Files starting with '_' are considered 'hidden' like unix files starting with '.'.
# characters are allowed in HDFS file names.

Question No : 19
Assuming default settings, which best describes the order of data provided to a reducer’s
reduce method:
A. The keys given to a reducer aren’t in a predictable order, but the values associated with
those keys always are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C. Neither keys nor values are in any predictable order.
D. The keys given to a reducer are in sorted order but the values associated with each key
are in no predictable order
Answer: D
Explanation: Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have
output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they
are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application
should extend the key with the secondary key and define a grouping comparator. The keys
will be sorted using the entire key, but will be grouped using the grouping comparator to
decide which keys and values are sent in the same call to reduce.
   13
3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Question No : 20
Given the following Pig commands:
Which one of the following statements is true?
A. The $1 variable represents the first column of data in 'my.log'
B. The $1 variable represents the second column of data in 'my.log'
C. The severe relation is not valid
D. The grouped relation is not valid
Answer: B

Question No : 21
You have written a Mapper which invokes the following five calls to the
OutputColletor.collect method:
output.collect (new Text (“Apple”), new Text (“Red”) ) ;
output.collect (new Text (“Banana”), new Text (“Yellow”) ) ;
output.collect (new Text (“Apple”), new Text (“Yellow”) ) ;
output.collect (new Text (“Cherry”), new Text (“Red”) ) ;
output.collect (new Text (“Apple”), new Text (“Green”) ) ;
How many times will the Reducer’s reduce method be invoked?
A. 6
B. 3
C. 1
D. 0
E. 5
Answer: B
Explanation: reduce() gets called once for each [key, (list of values)] pair. To explain, let's
say you called:
out.collect(new Text("Car"),new Text("Subaru");
out.collect(new Text("Car"),new Text("Honda");
out.collect(new Text("Car"),new Text("Ford");
out.collect(new Text("Truck"),new Text("Dodge");
out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)
Reference: Mapper output.collect()?

Question No : 22
Assuming the following Hive query executes successfully:
Which one of the following statements describes the result set?
A. A bigram of the top 80 sentences that contain the substring "you are" in the lines column
of the input data A1 table.
B. An 80-value ngram of sentences that contain the words "you" or "are" in the lines
column of the inputdata table.
C. A trigram of the top 80 sentences that contain "you are" followed by a null space in the
lines column of the inputdata table.
D. A frequency distribution of the top 80 words that follow the subsequence "you are" in the
lines column of the inputdata table.
Answer: D

Question No : 23
For each input key-value pair, mappers can emit:
A. As many intermediate key-value pairs as designed. There are no restrictions on the
types of those key-value pairs (i.e., they can be heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type
as the input key-value pair.
C. One intermediate key-value pair, of a different type.
D. One intermediate key-value pair, but of the same type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the
same types and all the values have the same type.
Answer: E
Explanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input records.
A given input pair may map to zero or many output pairs.
Reference: Hadoop Map-Reduce Tutorial

Question No : 24
In Hadoop 2.2, which one of the following statements is true about a standby NameNode?
The Standby NameNode:
A. Communicates directly with the active NameNode to maintain the state of the active
NameNode.
B. Receives the same block reports as the active NameNode.
C. Runs on the same machine and shares the memory of the active NameNode.
D. Processes all client requests and block reports from the appropriate DataNodes.
Answer: B

Question No : 25
Which HDFS command copies an HDFS file named foo to the local filesystem as localFoo?
A. hadoop fs -get foo LocalFoo
B. hadoop -cp foo LocalFoo
C. hadoop fs -Is foo
D. hadoop fs -put foo LocalFoo
Answer: A

Question No : 26
Given the following Hive command:
Which one of the following statements is true?
A. The files in the mydata folder are copied to a subfolder of /apps/hlve/warehouse
B. The files in the mydata folder are moved to a subfolder of /apps/hive/wa re house
C. The files in the mydata folder are copied into Hive's underlying relational database
D. The files in the mydata folder do not move from their current location In HDFS
Answer: D

Question No : 27
You want to run Hadoop jobs on your development workstation for testing before you
submit them to your production cluster. Which mode of operation in Hadoop allows you to
most closely simulate a production cluster while using a single machine?
A. Run all the nodes in your production cluster as virtual machines on your development
workstation.
B. Run the hadoop command with the –jt local and the –fs file:///options.
C. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single
machine.
D. Run simldooop, the Apache open-source software for simulating Hadoop clusters.
Answer: C

Question No : 28
To use a lava user-defined function (UDF) with Pig what must you do?
A. Define an alias to shorten the function name
B. Pass arguments to the constructor of UDFs implementation class
C. Register the JAR file containing the UDF
D. Put the JAR file into the user&apos;s home folder in HDFS
Answer: C

Question No : 29
Which two of the following statements are true about Pig's approach toward data? Choose
2 answers
A. Accepts only data that has a key/value pair structure
B. Accepts data whether it has metadata or not
C. Accepts only data that is defined by metadata tables stored in a database
D. Accepts tab-delimited text data only
E. Accepts any data: structured or unstructured
Answer: B,E

Question No : 30
Your cluster’s HDFS block size in 64MB. You have directory containing 100 plain text files,
each of which is 100MB in size. The InputFormat for your job is TextInputFormat.
Determine how many Mappers will run?
A. 64
B. 100
C. 200
D. 640
Answer: C
Explanation: Each file would be split into two as the block size (64 MB) is less than the file
size (100 MB), so 200 mappers would be running.
Note:
If you're not compressing the files then hadoop will process your large files (say 10G), with
a number of mappers related to the block size of the file.
Say your block size is 64M, then you will have ~160 mappers processing this 10G file
(160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an
acceptable blocks size, but if you find that your mappers are executing in sub minute times,
then you might want to increase the work done by each mapper (by increasing the block
size to 128, 256, 512m - the actual size depends on how you intend to process the data).
Reference: http://stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriateinput-
files-size (first answer, second paragraph)

Question No : 31
You need to perform statistical analysis in your MapReduce job and would like to call
methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte
Java archive (JAR) file. Which is the best way to make this library available to your
MapReducer job at runtime?
A. Have your system administrator copy the JAR to all nodes in the cluster and set its
location in the HADOOP_CLASSPATH environment variable before you submit your job.
B. Have your system administrator place the JAR file on a Web server accessible to all
cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C. When submitting the job on the command line, specify the –libjars option followed by the
JAR file path.
D. Package your code and the Apache Commands Math library into a zip file named
JobJar.zip
Answer: C
Explanation: The usage of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of
these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.

Question No : 32
Which Hadoop component is responsible for managing the distributed file system
metadata?
A. NameNode
B. Metanode
C. DataNode
D. NameSpaceManager
Answer: A

Question No : 33
Which one of the following statements describes the relationship between the
NodeManager and the ApplicationMaster?
A. The ApplicationMaster starts the NodeManager in a Container
B. The NodeManager requests resources from the ApplicationMaster
C. The ApplicationMaster starts the NodeManager outside of a Container
D. The NodeManager creates an instance of the ApplicationMaster
Answer: D

Question No : 34
Given a directory of files with the following structure: line number, tab character, string:
Example:
1abialkjfjkaoasdfjksdlkjhqweroij
2kadfjhuwqounahagtnbvaswslmnbfgy
3kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you
use to complete the line: conf.setInputFormat (____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat
Answer: C
Explanation:
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-inhadoop

2 comments:

DumpsPass4sure said...

I had an awesome experience at Dumpspass4sure under the supervision of highly qualified experts who have created Pass4sure CompTIA dumps. This short study guide shortly deals with all the syllabus contents that helped me to cover my syllabus within the provided time frame. I am so much filled with pleasure after achieving guaranteed success with CompTIA questions and answers.

Babit said...

Thanks for sharing such information with us.
Hadoop Certification in Pune