Question No : 1
Workflows expressed in Oozie can contain:
A. Sequences of
MapReduce and Pig. These sequences can be combined with other
actions including forks, decision points, and path joins.
B. Sequences of
MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce
sequences can be combined with forks and path joins.
C. Sequences of
MapReduce and Pig jobs. These are limited to linear sequences of
actions with exception handlers but no forks.
D. Iterntive repetition
of MapReduce jobs until a desired answer or state is reached.
Answer: A
Explanation: Oozie
workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig
jobs) arranged in a control dependency DAG (Direct
Acyclic Graph), specifying a sequence
of actions execution. This graph is specified in hPDL (a
XML Process Definition Language).
hPDL is a fairly compact language, using a limited amount
of flow control and action nodes.
Control nodes define the flow of execution and include
beginning and end of a workflow
(start, end and fail nodes) and mechanisms to control the
workflow execution path (
decision, fork and join nodes).
Workflow definitions
Currently running workflow instances, including instance
states and variables
Reference: Introduction to Oozie
Note: Oozie is a Java Web-Application that runs in a Java
servlet-container - Tomcat and
uses a database to store:
Question No : 2
You are developing a combiner that takes as input Text
keys, IntWritable values, and emits
Text keys, IntWritable values. Which interface should
your class implement?
A. Combiner <Text,
IntWritable, Text, IntWritable>
B. Mapper <Text,
IntWritable, Text, IntWritable>
C. Reducer <Text,
Text, IntWritable, IntWritable>
D. Reducer <Text,
IntWritable, Text, IntWritable>
E. Combiner <Text,
Text, IntWritable, IntWritable>
Answer: D
Question No : 3
Which TWO of the following statements are true regarding
Hive? Choose 2 answers
A. Useful for data
analysts familiar with SQL who need to do ad-hoc queries
B. Offers real-time
queries and row level updates
C. Allows you to define
a structure for your unstructured Big Data
D. Is a relational
database
Answer: A,C
Question No : 4
You need to create a job that does frequency analysis on
input data. You will do this by
writing a Mapper that uses TextInputFormat and splits
each value (a line of text from an
input file) into individual characters. For each one of
these characters, you will emit the
character as a key and an InputWritable as the value. As
this will produce proportionally
more intermediate data than input data, which two
resources should you expect to be
bottlenecks?
A. Processor and network
I/O
B. Disk I/O and network
I/O
C. Processor and RAM
D. Processor and disk
I/O
Answer: B
Question No : 5
Which one of the following classes would a Pig command
use to store data in a table
defined in HCatalog?
A. org.apache.hcatalog.pig.HCatOutputFormat
B. org.apache.hcatalog.pig.HCatStorer
C. No special class is
needed for a Pig script to store data in an HCatalog table
D. Pig scripts cannot
use an HCatalog table
Answer: B
Question No : 6
All keys used for intermediate output from mappers must:
A. Implement a
splittable compression algorithm.
B. Be a subclass of
FileInputFormat.
C. Implement
WritableComparable.
D. Override isSplitable.
E. Implement a
comparator for speedy sorting.
Answer: C
Explanation: The
MapReduce framework operates exclusively on <key, value> pairs, that
is, the framework views the input to the job as a set of
<key, value> pairs and produces a
set of <key, value> pairs as the output of the job,
conceivably of different types.
The key and value classes have to be serializable by the
framework and hence need to
implement the Writable interface. Additionally, the key
classes have to implement the
WritableComparable interface to facilitate sorting by the
framework.
Reference: MapReduce Tutorial
Question No : 7
What types of algorithms are difficult to express in
MapReduce v1 (MRv1)?
A. Algorithms that
require applying the same mathematical function to large numbers of
individual binary records.
B. Relational operations
on large amounts of structured and semi-structured data.
C. Algorithms that
require global, sharing states.
D. Large-scale graph
algorithms that require one-step link traversal.
E. Text analysis
algorithms on large collections of unstructured text (e.g, Web crawls).
Answer: C
Explanation: See 3)
below.
Limitations of Mapreduce – where not to use Mapreduce
While very powerful and applicable to a wide variety of
problems, MapReduce is not the
answer to every problem. Here are some problems I found
where MapReudce is not suited
and some papers that address the limitations of MapReuce.
1. Computation depends on previously computed values
If the computation of a value depends on previously
computed values, then MapReduce
cannot be used. One good example is the Fibonacci series
where each value is summation
of the previous two values. i.e., f(k+2) = f(k+1) + f(k).
Also, if the data set is small enough to
be computed on a single machine, then it is better to do
it as a single reduce(map(data))
operation rather than going through the entire map reduce
process.
2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional,
and the Reduce step must not
generate a large amount of data or there will be a
serious performance degradation. For
example, CouchDB’s MapReduce may not be a good fit for
full-text indexing or ad hoc
searching. This is a problem better suited for a tool
such as Lucene.
3. Algorithms depend on shared global state
Solutions to many interesting problems in text processing
do not require global
synchronization. As a result, they can be expressed
naturally in MapReduce, since map
and reduce tasks run independently and in isolation.
However, there are many examples of
algorithms that depend crucially on the existence of
shared global state during processing,
making them difficult to implement in MapReduce (since
the single opportunity for global
synchronization in MapReduce is the barrier between the
map and reduce phases of
processing)
Reference: Limitations of Mapreduce – where not to use
Mapreduce
Question No : 8
How are keys and values presented and passed to the
reducers during a standard sort and
shuffle phase of MapReduce?
A. Keys are presented to
reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to
reducer in sorted order; values for a given key are sorted in
ascending order.
C. Keys are presented to
a reducer in random order; values for a given key are not sorted.
D. Keys are presented to
a reducer in random order; values for a given key are sorted in
ascending order.
Answer: A
Explanation: Reducer
has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper
using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since
different Mappers may have
output the same key).
The shuffle and sort phases occur simultaneously i.e. while
outputs are being fetched they
are merged.
SecondarySort
To achieve a secondary sort on the values returned by the
value iterator, the application
should extend the key with the secondary key and define a
grouping comparator. The keys
will be sorted using the entire key, but will be grouped
using the grouping comparator to
decide which keys and values are sent in the same call to
reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context)
method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a
RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Question No : 9
Which best describes how TextInputFormat processes input
files and line breaks?
A. Input file splits may
cross line breaks. A line that crosses file splits is read by the
RecordReader of the split that contains the beginning of
the broken line.
B. Input file splits may
cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.
C. The input file is
split exactly at the line breaks, so each RecordReader will read a series
of complete lines.
D. Input file splits may
cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may
cross line breaks. A line that crosses file splits is read by the
RecordReader of the split that contains the end of the
broken line.
Answer: A
Reference: How Map and Reduce operations are actually
carried out
Question No : 10
To process input key-value pairs, your mapper needs to
lead a 512 MB data file in memory.
What is the best way to accomplish this?
A. Serialize the data
file, insert in it the JobConf object, and read the data into memory in
the configure method of the mapper.
B. Place the data file
in the DistributedCache and read the data into memory in the map
method of the mapper.
C. Place the data file
in the DataCache and read the data into memory in the configure
method of the mapper.
D. Place the data file
in the DistributedCache and read the data into memory in the
configure method of the mapper.
Answer: C
Question No : 11
Consider the following two relations, A and B.
Which Pig statement combines A by its first field and B
by its second field?
A. C = DOIN B BY a1, A
by b2;
B. C = JOIN A by al, B
by b2;
C. C = JOIN A a1, B b2;
D. C = JOIN A SO, B $1;
Answer: B
Question No : 12
What is the disadvantage of using multiple reducers with
the default HashPartitioner and
distributing your workload across you cluster?
A. You will not be able
to compress the intermediate data.
B. You will longer be
able to take advantage of a Combiner.
C. By using multiple
reducers with the default HashPartitioner, output files may not be in
globally sorted order.
D. There are no concerns
with this approach. It is always advisable to use multiple
reduces.
Answer: C
Explanation: Multiple
reducers and total ordering
If your sort job runs with multiple reducers (either
because mapreduce.job.reduces in
mapred-site.xml has been set to a number larger than 1,
or because you’ve used the -r
option to specify the number of reducers on the
command-line), then by default Hadoop will
use the HashPartitioner to distribute records across the
reducers. Use of the
HashPartitioner means that you can’t concatenate your
output files to create a single sorted
output file. To do this you’ll need total ordering,
Reference: Sorting text files with MapReduce
Question No : 13
Identify the MapReduce v2 (MRv2 / YARN) daemon
responsible for launching application
containers and monitoring application resource usage?
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. ApplicationMasterService
E. TaskTracker
F. JobTracker
Answer: B
Reference: Apache Hadoop YARN – Concepts &
Applications
Question No : 14
You have the following key-value pairs as output from
your Map task:
(the, 1)
(fox, 1)
9
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer’s reduce
method?
A. Six
B. Five
C. Four
D. Two
E. One
F. Three
Answer: B
Explanation: Only one
key value pair will be passed from the two (the, 1) key value pairs.
Question No : 15
What data does a Reducer reduce method process?
A. All the data in a
single input file.
B. All data produced by
a single mapper.
C. All data for a given
key, regardless of which mapper(s) produced it.
D. All data for a given
value, regardless of which mapper(s) produced it.
Answer: C
Explanation: Reducing
lets you aggregate values together. A reducer function receives an
iterator of input values from an input list. It then
combines these values together, returning
a single output value.
All values with the same key are presented to a single
reduce task.
Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce
Question No : 16
You want to count the number of occurrences for each
unique word in the supplied input
data. You’ve decided to implement this by having your
mapper tokenize each word and
emit a literal value 1, and then have your reducer
increment a counter for each literal 1 it
receives. After successful implementing this, it occurs
to you that you could optimize this by
specifying a combiner. Will you be able to reuse your
existing Reduces as your combiner in
this case and why or why not?
A. Yes, because the sum
operation is both associative and commutative and the input and
output types to the reduce method match.
B. No, because the sum
operation in the reducer is incompatible with the operation of a
Combiner.
C. No, because the
Reducer and Combiner are separate interfaces.
D. No, because the
Combiner is incompatible with a mapper which doesn’t use the same
data type for both the key and value.
E. Yes, because Java is
a polymorphic object-oriented language and thus reducer code
can be reused as a combiner.
Answer: A
Explanation: Combiners
are used to increase the efficiency of a MapReduce program.
They are used to aggregate intermediate map output
locally on individual mapper outputs.
Combiners can help you reduce the amount of data that
needs to be transferred across to
the reducers. You can use your reducer code as a combiner
if the operation performed is
commutative and associative. The execution of combiner is
not guaranteed, Hadoop may
or may not execute a combiner. Also, if required it may
execute it more then 1 times.
Therefore your MapReduce jobs should not depend on the
combiners execution.
Reference: 24 Interview Questions & Answers for
Hadoop MapReduce developers, What
are combiners? When should I use a combiner in my
MapReduce Job?
Question No : 17
MapReduce v2 (MRv2/YARN) splits which major functions of
the JobTracker into separate
daemons? Select two.
A. Heath states checks
(heartbeats)
B. Resource management
C. Job
scheduling/monitoring
D. Job coordination
between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system
metadata
G. MapReduce metric
reporting
H. Managing tasks
Answer: B,C
Explanation: The
fundamental idea of MRv2 is to split up the two major functionalities of
the JobTracker, resource management and job
scheduling/monitoring, into separate
daemons. The idea is to have a global ResourceManager
(RM) and per-application
ApplicationMaster (AM). An application is either a single
job in the classical sense of Map-
Reduce jobs or a DAG of jobs.
Note:
The central goal of YARN is to clearly separate two
things that are unfortunately smushed
together in current Hadoop, specifically in (mainly)
JobTracker:
/ Monitoring the status of the cluster with respect to
which nodes have which resources
available. Under YARN, this will be global.
/ Managing the parallelization execution of any specific
job. Under YARN, this will be done
separately for each job.
Reference: Apache Hadoop YARN – Concepts &
Applications
Question No : 18
You have a directory named jobdata in HDFS that contains
four files: _first.txt, second.txt,
.third.txt and #data.txt. How many files will be
processed by the
FileInputFormat.setInputPaths () command when it's given
a path object representing this
directory?
A. Four, all files will
be processed
B. Three, the pound sign
is an invalid character for HDFS file names
C. Two, file names with
a leading period or underscore are ignored
D. None, the directory
cannot be named jobdata
E. One, no special
characters can prefix the name of an input file
Answer: C
Explanation: Files
starting with '_' are considered 'hidden' like unix files starting with '.'.
# characters are allowed in HDFS file names.
Question No : 19
Assuming default settings, which best describes the order
of data provided to a reducer’s
reduce method:
A. The keys given to a
reducer aren’t in a predictable order, but the values associated with
those keys always are.
B. Both the keys and
values passed to a reducer always appear in sorted order.
C. Neither keys nor
values are in any predictable order.
D. The keys given to a
reducer are in sorted order but the values associated with each key
are in no predictable order
Answer: D
Explanation: Reducer
has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper
using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since
different Mappers may have
output the same key).
The shuffle and sort phases occur simultaneously i.e.
while outputs are being fetched they
are merged.
SecondarySort
To achieve a secondary sort on the values returned by the
value iterator, the application
should extend the key with the secondary key and define a
grouping comparator. The keys
will be sorted using the entire key, but will be grouped
using the grouping comparator to
decide which keys and values are sent in the same call to
reduce.
13
3. Reduce
In this phase the reduce(Object, Iterable, Context)
method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a
RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Question No : 20
Given the following Pig commands:
Which one of the following statements is true?
A. The $1 variable
represents the first column of data in 'my.log'
B. The $1 variable
represents the second column of data in 'my.log'
C. The severe relation
is not valid
D. The grouped relation
is not valid
Answer: B
Question No : 21
You have written a Mapper which invokes the following
five calls to the
OutputColletor.collect method:
output.collect (new Text (“Apple”), new Text (“Red”) ) ;
output.collect (new Text (“Banana”), new Text (“Yellow”)
) ;
output.collect (new Text (“Apple”), new Text (“Yellow”) )
;
output.collect (new Text (“Cherry”), new Text (“Red”) ) ;
output.collect (new Text (“Apple”), new Text (“Green”) )
;
How many times will the Reducer’s reduce method be
invoked?
A. 6
B. 3
C. 1
D. 0
E. 5
Answer: B
Explanation: reduce()
gets called once for each [key, (list of values)] pair. To explain, let's
say you called:
out.collect(new Text("Car"),new
Text("Subaru");
out.collect(new Text("Car"),new
Text("Honda");
out.collect(new Text("Car"),new
Text("Ford");
out.collect(new Text("Truck"),new
Text("Dodge");
out.collect(new Text("Truck"),new
Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)
Reference: Mapper output.collect()?
Question No : 22
Assuming the following Hive query executes successfully:
Which one of the following statements describes the
result set?
A. A bigram of the top
80 sentences that contain the substring "you are" in the lines column
of the input data A1 table.
B. An 80-value ngram of
sentences that contain the words "you" or "are" in the
lines
column of the inputdata table.
C. A trigram of the top
80 sentences that contain "you are" followed by a null space in the
lines column of the inputdata table.
D. A frequency
distribution of the top 80 words that follow the subsequence "you
are" in the
lines column of the inputdata table.
Answer: D
Question No : 23
For each input key-value pair, mappers can emit:
A. As many intermediate
key-value pairs as designed. There are no restrictions on the
types of those key-value pairs (i.e., they can be
heterogeneous).
B. As many intermediate
key-value pairs as designed, but they cannot be of the same type
as the input key-value pair.
C. One intermediate
key-value pair, of a different type.
D. One intermediate
key-value pair, but of the same type.
E. As many intermediate
key-value pairs as designed, as long as all the keys have the
same types and all the values have the same type.
Answer: E
Explanation: Mapper
maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input
records into intermediate records. The
transformed intermediate records do not need to be of the
same type as the input records.
A given input pair may map to zero or many output pairs.
Reference: Hadoop Map-Reduce Tutorial
Question No : 24
In Hadoop 2.2, which one of the following statements is
true about a standby NameNode?
The Standby NameNode:
A. Communicates directly
with the active NameNode to maintain the state of the active
NameNode.
B. Receives the same
block reports as the active NameNode.
C. Runs on the same
machine and shares the memory of the active NameNode.
D. Processes all client
requests and block reports from the appropriate DataNodes.
Answer: B
Question No : 25
Which HDFS command copies an HDFS file named foo to the
local filesystem as localFoo?
A. hadoop fs -get foo
LocalFoo
B. hadoop -cp foo
LocalFoo
C. hadoop fs -Is foo
D. hadoop fs -put foo
LocalFoo
Answer: A
Question No : 26
Given the following Hive command:
Which one of the following statements is true?
A. The files in the
mydata folder are copied to a subfolder of /apps/hlve/warehouse
B. The files in the
mydata folder are moved to a subfolder of /apps/hive/wa re house
C. The files in the
mydata folder are copied into Hive's underlying relational database
D. The files in the
mydata folder do not move from their current location In HDFS
Answer: D
Question No : 27
You want to run Hadoop jobs on your development workstation
for testing before you
submit them to your production cluster. Which mode of
operation in Hadoop allows you to
most closely simulate a production cluster while using a
single machine?
A. Run all the nodes in
your production cluster as virtual machines on your development
workstation.
B. Run the hadoop
command with the –jt local and the –fs file:///options.
C. Run the DataNode,
TaskTracker, NameNode and JobTracker daemons on a single
machine.
D. Run simldooop, the
Apache open-source software for simulating Hadoop clusters.
Answer: C
Question No : 28
To use a lava user-defined function (UDF) with Pig what
must you do?
A. Define an alias to
shorten the function name
B. Pass arguments to the
constructor of UDFs implementation class
C. Register the JAR file
containing the UDF
D. Put the JAR file into
the user's home folder in HDFS
Answer: C
Question No : 29
Which two of the following statements are true about
Pig's approach toward data? Choose
2 answers
A. Accepts only data
that has a key/value pair structure
B. Accepts data whether
it has metadata or not
C. Accepts only data
that is defined by metadata tables stored in a database
D. Accepts tab-delimited
text data only
E. Accepts any data:
structured or unstructured
Answer: B,E
Question No : 30
Your cluster’s HDFS block size in 64MB. You have
directory containing 100 plain text files,
each of which is 100MB in size. The InputFormat for your
job is TextInputFormat.
Determine how many Mappers will run?
A. 64
B. 100
C. 200
D. 640
Answer: C
Explanation: Each file
would be split into two as the block size (64 MB) is less than the file
size (100 MB), so 200 mappers would be running.
Note:
If you're not compressing the files then hadoop will
process your large files (say 10G), with
a number of mappers related to the block size of the
file.
Say your block size is 64M, then you will have ~160
mappers processing this 10G file
(160*64 ~= 10G). Depending on how CPU intensive your
mapper logic is, this might be an
acceptable blocks size, but if you find that your mappers
are executing in sub minute times,
then you might want to increase the work done by each
mapper (by increasing the block
size to 128, 256, 512m - the actual size depends on how
you intend to process the data).
Reference:
http://stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriateinput-
files-size (first answer, second paragraph)
Question No : 31
You need to perform statistical analysis in your
MapReduce job and would like to call
methods in the Apache Commons Math library, which is
distributed as a 1.3 megabyte
Java archive (JAR) file. Which is the best way to make
this library available to your
MapReducer job at runtime?
A. Have your system
administrator copy the JAR to all nodes in the cluster and set its
location in the HADOOP_CLASSPATH environment variable
before you submit your job.
B. Have your system administrator
place the JAR file on a Web server accessible to all
cluster nodes and then set the HTTP_JAR_URL environment
variable to its location.
C. When submitting the
job on the command line, specify the –libjars option followed by the
JAR file path.
D. Package your code and
the Apache Commands Math library into a zip file named
JobJar.zip
Answer: C
Explanation: The usage
of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all
the tasks you can do any one of
these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.
Question No : 32
Which Hadoop component is responsible for managing the
distributed file system
metadata?
A. NameNode
B. Metanode
C. DataNode
D. NameSpaceManager
Answer: A
Question No : 33
Which one of the following statements describes the
relationship between the
NodeManager and the ApplicationMaster?
A. The ApplicationMaster
starts the NodeManager in a Container
B. The NodeManager
requests resources from the ApplicationMaster
C. The ApplicationMaster
starts the NodeManager outside of a Container
D. The NodeManager
creates an instance of the ApplicationMaster
Answer: D
Question No : 34
Given a directory of files with the following structure:
line number, tab character, string:
Example:
1abialkjfjkaoasdfjksdlkjhqweroij
2kadfjhuwqounahagtnbvaswslmnbfgy
3kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper.
Which InputFormat should you
use to complete the line: conf.setInputFormat
(____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat
Answer: C
Explanation:
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-inhadoop
2 comments:
I had an awesome experience at Dumpspass4sure under the supervision of highly qualified experts who have created Pass4sure CompTIA dumps. This short study guide shortly deals with all the syllabus contents that helped me to cover my syllabus within the provided time frame. I am so much filled with pleasure after achieving guaranteed success with CompTIA questions and answers.
Thanks for sharing such information with us.
Hadoop Certification in Pune
Post a Comment