Happy Learning: April 2015

Sunday, 26 April 2015

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step No.	Operation
1	Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2	Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.
3	Get Metadata The compiler sends metadata request to Metastore (any database).
4	Send Metadata Metastore sends metadata as a response to the compiler.
5	Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.
6	Execute Plan The driver sends the execute plan to the execution engine.
7	Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1	Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
8	Fetch Result The execution engine receives the results from Data nodes.
9	Send Results The execution engine sends those resultant values to the driver.
10	Send Results The driver sends the results to Hive Interfaces.

Saturday, 25 April 2015

How to Improve Hive Performance

Use Hive on Tez:

set hive.execution.engine=tez;
With the above setting, every HIVE query you execute will take advantage of Tez.

Use ORCFile:

Using ORCFile for every HIVE table should really be a no-brainer and extremely beneficial to get fast response times for your HIVE queries.

As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:

SELECT A.customerID, A.name, A.age, A.address join
B.role, B.department, B.salary
ON A.customerID=B.customerID;

This query may take a long time to execute since tables A and B are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:

CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);

INSERT INTO TABLE A_ORC SELECT * FROM A;

CREATE TABLE B_ORC (
customerID int, role string, salary float, department string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);

INSERT INTO TABLE B_ORC SELECT * FROM B;

SELECT A_ORC.customerID, A_ORC.name,
A_ORC.age, A_ORC.address join
B_ORC.role, B_ORC.department, B_ORC.salary
ON A_ORC.customerID=B_ORC.customerID;

ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.

Use Vectorization:

Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

Cost based query optimization:

Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query – that is, until now.

A recent addition to Hive, Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.

To use cost-based optimization (also known as CBO), set the following parameters at the beginning of your query:

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.

For example, in a table tweets we want to collect statistics about the table and about 2 columns: “sender” and “topic”:

analyze table tweets compute statistics;

analyze table tweets compute statistics for columns sender, topic;

With HIVE 0.14 (on HDP 2.2) the analyze command works much faster, and you don’t need to specify each column, so you can just issue:

analyze table tweets compute statistics for columns;

That’s it. Now executing a query using this table should result in a different execution plan that is faster because of the cost calculation and different execution plan created by Hive.

Pig Interview Questions 2015

Pig Interview questions

1. How pig calculates the number of reducers for a job
2. How to change the default reducer calculation method in pig
3. What is hash based aggregation
4. Performance optimizations in pig
5. Types of joins in pig
6. Explain about merge-sparse join and when do we use it.
7. What are the conditions to use merge joins in Pig
8. How to implement customized loader
9. Can we store the data in partitioned manner? If yes, How?
10. How to load Hbase/Hive table in Pig
11. How to load partitioned hive table in pig
12. Which of the below efficient?

i. A = load ' abc.txt' using PigStorage(",");
B = Filter A by $0>10 and $1>15;

ii. A = load ' abc.txt' using PigStorage(",");
B = Filter A by $0>10;
C = Filter B by $1>15;

Thursday, 23 April 2015

Merge Sort

Mergesort has two phases:

first, we divide the unsorted input set into n subsets of size 1, which are obviously sorted. Second, we combine two or more ordered subsets into a single ordered one. This is called the merge phase.

In the second step, we repeatedly compare the smallest remaining record from one subset with the smallest remaining record from the other input subset. The smallest of these two (assuming a two-way merge, where only two subsets are merged at a time) records is moved to the output. The process repeats until one of the subsets becomes empty.

The output of a merge step can be an intermediary output, which will be further merged with other intermediary outputs or a final output, when all n subsets divided into the first step have been merged into one last sorted set of size n.

Hadoop Interview Questions 2015 - Part 2

Explain Build in sort mechanism to sort a file with single column with Billion records.

Default sort mechanism used in MapReduce?

How does a data of 200MB will be read by Mapper when block size is 64MB?

Explain secondary sort,Map&Reduce side joins

How to specify string as delimiter in hive?

Working with Counters: https://www.mapr.com/blog/managing-monitoring-and-testing-mapreduce-jobs-how-work-counters#.VU62SflVikohttps://www.mapr.com/blog/managing-monitoring-and-testing-mapreduce-jobs-how-work-counters#.VU62SflViko

HDFS Metadata

HDFS metadata represents the structure of HDFS directories and files in a tree. It also includes the various attributes of directories and files, such as ownership, permissions, quotas, and replication factor. In this blog post, I’ll describe how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files. All examples shown are from testing a build of the soon-to-be-released Apache Hadoop 2.6.0.

WARNING: Do not attempt to modify metadata directories or files. Unexpected modifications can cause HDFS downtime, or even permanent data loss. This information is provided for educational purposes only.

Persistence of HDFS metadata broadly breaks down into 2 categories of files:

fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
edits – An edits file is a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.

Checkpointing is the process of merging the content of the most recent fsimage with all edits applied after that fsimage is merged in order to create a new fsimage. Checkpointing is triggered automatically by configuration policies or manually by HDFS administration commands.

NameNode

Here is an example of an HDFS metadata directory taken from a NameNode. This shows the output of running the tree command on the metadata directory, which is configured by setting dfs.namenode.name.dir in hdfs-site.xml.

data/dfs/name
├── current
│ ├── VERSION
│ ├── edits_0000000000000000001-0000000000000000007
│ ├── edits_0000000000000000008-0000000000000000015
│ ├── edits_0000000000000000016-0000000000000000022
│ ├── edits_0000000000000000023-0000000000000000029
│ ├── edits_0000000000000000030-0000000000000000030
│ ├── edits_0000000000000000031-0000000000000000031
│ ├── edits_inprogress_0000000000000000032
│ ├── fsimage_0000000000000000030
│ ├── fsimage_0000000000000000030.md5
│ ├── fsimage_0000000000000000031
│ ├── fsimage_0000000000000000031.md5
│ └── seen_txid
└── in_use.lock

In this example, the same directory has been used for both fsimage and edits. Alternatively, configuration options are available that allow separating fsimage and edits into different directories. Each file within this directory serves a specific purpose in the overall scheme of metadata persistence:

VERSION – Text file that contains:

layoutVersion – The version of the HDFS metadata format. When we add new features that require changing the metadata format, we change this number. An HDFS upgrade is required when the current HDFS software uses a layout version newer than what is currently tracked here.
namespaceID/clusterID/blockpoolID – These are unique identifiers of an HDFS cluster. The identifiers are used to prevent DataNodes from registering accidentally with an incorrect NameNode that is part of a different cluster. These identifiers also are particularly important in a federated deployment. Within a federated deployment, there are multiple NameNodes working independently. Each NameNode serves a unique portion of the namespace (namespaceID) and manages a unique set of blocks (blockpoolID). The clusterID ties the whole cluster together as a single logical unit. It’s the same across all nodes in the cluster.
storageType – This is either NAME_NODE or JOURNAL_NODE. Metadata on a JournalNode in an HA deployment is discussed later.
cTime – Creation time of file system state. This field is updated during HDFS upgrades.

edits_start transaction ID-end transaction ID – These are finalized (unmodifiable) edit log segments. Each of these files contains all of the edit log transactions in the range defined by the file name’s through . In an HA deployment, the standby can only read up through the finalized log segments. It will not be up-to-date with the current edit log in progress (described next). However, when an HA failover happens, the failover finalizes the current log segment so that it’s completely caught up before switching to active.
edits_inprogress__start transaction ID – This is the current edit log in progress. All transactions starting from are in this file, and all new incoming transactions will get appended to this file. HDFS pre-allocates space in this file in 1 MB chunks for efficiency, and then fills it with incoming transactions. You’ll probably see this file’s size as a multiple of 1 MB. When HDFS finalizes the log segment, it truncates the unused portion of the space that doesn’t contain any transactions, so the finalized file’s space will shrink down.
fsimage_end transaction ID – This contains the complete metadata image up through . Each fsimage file also has a corresponding .md5 file containing a MD5 checksum, which HDFS uses to guard against disk corruption.
seen_txid - This contains the last transaction ID of the last checkpoint (merge of edits into a fsimage) or edit log roll (finalization of current edits_inprogress and creation of a new one). Note that this is not the last transaction ID accepted by the NameNode. The file is not updated on every transaction, only on a checkpoint or an edit log roll. The purpose of this file is to try to identify if edits are missing during startup. It’s possible to configure the NameNode to use separate directories for fsimage and edits files. If the edits directory accidentally gets deleted, then all transactions since the last checkpoint would go away, and the NameNode would start up using just fsimage at an old state. To guard against this, NameNode startup also checks seen_txid to verify that it can load transactions at least up through that number. It aborts startup if it can’t.
in_use.lock – This is a lock file held by the NameNode process, used to prevent multiple NameNode processes from starting up and concurrently modifying the directory.

JournalNode

In an HA deployment, edits are logged to a separate set of daemons called JournalNodes. A JournalNode’s metadata directory is configured by setting dfs.journalnode.edits.dir. The JournalNode will contain a VERSION file, multiple edits__ files and an edits_inprogress_, just like the NameNode. The JournalNode will not have fsimage files or seen_txid. In addition, it contains several other files relevant to the HA implementation. These files help prevent a split-brain scenario, in which multiple NameNodes could think they are active and all try to write edits.

committed-txid – Tracks last transaction ID committed by a NameNode.
last-promised-epoch – This file contains the “epoch,” which is a monotonically increasing number. When a new writer (a new NameNode) starts as active, it increments the epoch and presents it in calls to the JournalNode. This scheme is the NameNode’s way of claiming that it is active and requests from another NameNode, presenting a lower epoch, must be ignored.
last-writer-epoch – Similar to the above, but this contains the epoch number associated with the writer who last actually wrote a transaction. (This was a bug fix for an edge case not handled by last-promised-epoch alone.)
paxos – Directory containing temporary files used in implementation of the Paxos distributed consensus protocol. This directory often will appear as empty.

DataNode

Although DataNodes do not contain metadata about the directories and files stored in an HDFS cluster, they do contain a small amount of metadata about the DataNode itself and its relationship to a cluster. This shows the output of running the tree command on the DataNode’s directory, configured by setting dfs.datanode.data.dir inhdfs-site.xml.

data/dfs/data/
├── current
│ ├── BP-1079595417-192.168.2.45-1412613236271
│ │ ├── current
│ │ │ ├── VERSION
│ │ │ ├── finalized
│ │ │ │ └── subdir0
│ │ │ │ └── subdir1
│ │ │ │ ├── blk_1073741825
│ │ │ │ └── blk_1073741825_1001.meta
│ │ │ │── lazyPersist
│ │ │ └── rbw
│ │ ├── dncp_block_verification.log.curr
│ │ ├── dncp_block_verification.log.prev
│ │ └── tmp
│ └── VERSION
└── in_use.lock

The purpose of these files is:

BP-random integer-NameNode-IP address-creation time – The naming convention on this directory is significant and constitutes a form of cluster metadata. The name is a block pool ID. “BP” stands for “block pool,” the abstraction that collects a set of blocks belonging to a single namespace. In the case of a federated deployment, there will be multiple “BP” sub-directories, one for each block pool. The remaining components form a unique ID: a random integer, followed by the IP address of the NameNode that created the block pool, followed by creation time.
VERSION – Much like the NameNode and JournalNode, this is a text file containing multiple properties, such as layoutVersion, clusterId and cTime, all discussed earlier. There is a VERSION file tracked for the entire DataNode as well as a separate VERSION file in each block pool sub-directory. In addition to the properties already discussed earlier, the DataNode’s VERSION files also contain:

storageType – In this case, the storageType field is set to DATA_NODE.
blockpoolID – This repeats the block pool ID information encoded into the sub-directory name.

finalized/rbw - Both finalized and rbw contain a directory structure for block storage. This holds numerous block files, which contain HDFS file data and the corresponding .meta files, which contain checksum information. “Rbw” stands for “replica being written”. This area contains blocks that are still being written to by an HDFS client. The finalized sub-directory contains blocks that are not being written to by a client and have been completed.
lazyPersist – HDFS is incorporating a new feature to support writing transient data to memory, followed by lazy persistence to disk in the background. If this feature is in use, then a lazyPersist sub-directory is present and used for lazy persistence of in-memory blocks to disk. We’ll cover this exciting new feature in greater detail in a future blog post.
dncp_block_verification.log – This file tracks the last time each block was verified by checking its contents against its checksum. The last verification time is significant for deciding how to prioritize subsequent verification work. The DataNode orders its background block verification work in ascending order of last verification time. This file is rolled periodically, so it’s typical to see a .curr file (current) and a .prev file (previous).
in_use.lock – This is a lock file held by the DataNode process, used to prevent multiple DataNode processes from starting up and concurrently modifying the directory.

Commands

Various HDFS commands impact the metadata directories

Commands	Description
hdfs namenode	NameNode startup automatically saves a new checkpoint. As stated earlier, checkpointing is the process of merging any outstanding edit logs with the latest fsimage, saving the full state to a new fsimage file, and rolling edits. Rolling edits means finalizing the current edits_inprogress and starting a new one.
hdfs dfsadmin -safemode enter hdfs dfsadmin -saveNamespace	This saves a new checkpoint (much like restarting NameNode) while the NameNode process remains running. Note that the NameNode must be in safe mode, so all attempted write activity would fail while this is running.
hdfs dfsadmin -rollEdits	This manually rolls edits. Safe mode is not required. This can be useful if a standby NameNode is lagging behind the active and you want it to get caught up more quickly. (The standby NameNode can only read finalized edit log segments, not the current in progress edits file.)
hdfs dfsadmin -fetchImage	Downloads the latest fsimage from the NameNode. This can be helpful for a remote backup type of scenario.

Configuration Properties

Several configuration properties in hdfs-site.xml control the behavior of HDFS metadata directories.

dfs.namenode.name.dir – Determines where on the local filesystem the DFS name node should store the name table (fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.namenode.edits.dir - Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir.
dfs.namenode.checkpoint.period - The number of seconds between two periodic checkpoints.
dfs.namenode.checkpoint.txns – The standby will create a checkpoint of the namespace every ‘dfs.namenode.checkpoint.txns’ transactions, regardless of whether ‘dfs.namenode.checkpoint.period’ has expired.
dfs.namenode.checkpoint.check.period – How frequently to query for the number of uncheckpointed transactions.
dfs.namenode.num.checkpoints.retained - The number of image checkpoint files that will be retained in storage directories. All edit logs necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.
dfs.namenode.num.extra.edits.retained – The number of extra transactions which should be retained beyond what is minimally necessary for a NN restart. This can be useful for audit purposes or for an HA setup where a remote Standby Node may have been offline for some time and need to have a longer backlog of retained edits in order to start again.
dfs.namenode.edit.log.autoroll.multiplier.threshold – Determines when an active namenode will roll its own edit log. The actual threshold (in number of edits) is determined by multiplying this value by dfs.namenode.checkpoint.txns. This prevents extremely large edit files from accumulating on the active namenode, which can cause timeouts during namenode startup and pose an administrative hassle. This behavior is intended as a failsafe for when the standby fails to roll the edit log by the normal checkpoint threshold.
dfs.namenode.edit.log.autoroll.check.interval.ms – How often an active namenode will check if it needs to roll its edit log, in milliseconds.
dfs.datanode.data.dir – Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. Heterogeneous storage allows specifying that each directory resides on a different type of storage: DISK, SSD, ARCHIVE or RAM_DISK.

Conclusion

We briefly discussed how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files, the relevant configurations that drive specific behaviors, and appropriate HDFS metadata directory commands that print out the directory tree, initiate checkpoint, and create a fsimage

Wednesday, 22 April 2015

Hadoop Interview Questions 2015 - Part 1

Few more questions.. Happy Reading

1.Explain how Hadoop is different from other parallel computing solutions.

2.What are the modes Hadoop can run in?

3.What is a NameNode and what is a DataNode?

4.What is Shuffling in MapReduce?

5.What is the functionality of Task Tracker and Job Tracker in Hadoop? How many instances of a Task Tracker and Job Tracker can be run on a single Hadoop Cluster?

6.How does NameNode tackle DataNode failures?

7.What is InputFormat in Hadoop?

8.What is the purpose of RecordReader in Hadoop?

9.Why can't we use Java primitive data types in Map Reduce?

10.Explain how do you decide between Managed & External tables in hive

11.Can we change the default location of Managed tables

12.What are the points to consider when moving from an Oracle database to Hadoop clusters? How would you decide the correct size and number of nodes in a Hadoop cluster?

13.If you want to analyze 100TB of data, what is the best architecture for that?

14.What is InputSplit in MapReduce?

15 In Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to the reducer?

16.What is replication factor in Hadoop and what is default replication factor level Hadoop comes with?

17.What is SequenceFile in Hadoop and Explain its importance?

18.What is Speculative execution in Hadoop?

19.What are the factors that we consider while creating a hive table

20.What are the compression techniques and how do you decide which one to use

21.Co group in Pig

22.If you are the user of a MapReduce framework, then what are the configuration parameters you need to specify?

23.How do you benchmark your Hadoop Cluster with Hadoop tools?

24.Explain the difference between ORDER BY and SORT BY in Hive?

25.What is WebDAV in Hadoop?

26.How many Daemon processes run on a Hadoop System?

27.Hadoop attains parallelism by isolating the tasks across various nodes; it is possible for some of the slow nodes to rate-limit the rest of the program and slows down the program. What method Hadoop provides to combat this?

28.How are HDFS blocks replicated?

29.What will a Hadoop job do if developers try to run it with an output directory that is already present?

30.What happens if the number of reducers is 0?

31.What is meant by Map-side and Reduce-side join in Hadoop?

32.How can the NameNode be restarted?

33.How to include partitioned column in data - Hive

34.What hadoop -put command do exactly

35.What is the limit on Distributed cache size?

36.Handling skewed data

37.When doing a join in Hadoop, you notice that one reducer is running for a very long time. How will address this problem in Pig?

38.How can you debug your Hadoop code?

39.What is distributed cache and what are its benefits?

40.Why would a Hadoop developer develop a Map Reduce by disabling the reduce step?

41.Explain the major difference between an HDFS block and an InputSplit.

42.Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

43.What is the need for having a password-less SSH in a distributed environment?

44.Give an example scenario on the usage of counters.

45.Does HDFS make block boundaries between records?

46.What is streaming access?

47.What do you mean by “Heartbeat” in HDFS?

48.If there are 10 HDFS blocks to be copied from one machine to another. However, the other machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down during the time of replication?

49.What is the significance of conf.setMapper class?

50.What are combiners and when are these used in a MapReduce job?

51.What are the Different joins in hive?

52.Explain about SMB join in Hive

53.Which command is used to do a file system check in HDFS?

54.Explain about the different parameters of the mapper and reducer functions.

55.How can you set random number of mappers and reducers for a Hadoop job?

56.Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason? (Open Ended Question

57.Explain about the functioning of Master Slave architecture in Hadoop?

58.What is fault tolerance in HDFS?

59.Give some examples of companies that are using Hadoop architecture extensively.

60.How does a DataNode know the location of the NameNode in Hadoop cluster?

61.How can you check whether the NameNode is working or not?

62.Explain about the different types of “writes” in HDFS.

Hope this helps!