~asanjar/charms/trusty/hdp-hadoop/hpcloud : revision 13

1

## Overview

2

**What is Hortonworks Apache Hadoop (HDP 2.1.3) ?**

3

The Apache Hadoop software library is a framework that allows for the

4

distributed processing of large data sets across clusters of computers

5

using a simple programming model.

6

7

It is designed to scale up from single servers to thousands of machines,

8

each offering local computation and storage. Rather than rely on hardware

9

to deliver high-avaiability, the library itself is designed to detect

10

and handle failures at the application layer, so delivering a

11

highly-availabile service on top of a cluster of computers, each of

12

which may be prone to failures.

13

14

Apache Hadoop 2.4.1 consists of significant improvements over the previous stable release (hadoop-1.x).

15

16

Here is a short overview of the improvments to both HDFS and MapReduce.

17

- **HDFS Federation**

18

In order to scale the name service horizontally, federation uses multiple independent

19

Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent

20

and don't require coordination with each other. The datanodes are used as common storage for

21

blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster.

22

Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.

23

24

More details are available in the HDFS Federation document:

25

<http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/Federation.html>

26

27

- **MapReduce NextGen aka YARN aka MRv2**

28

The new architecture introduced in hadoop-0.23, divides the two major functions of the

29

JobTracker: resource management and job life-cycle management into separate components.

30

The new ResourceManager manages the global assignment of compute resources to

31

applications and the per-application ApplicationMaster manages the application‚

32

scheduling and coordination.

33

An application is either a single job in the sense of classic MapReduce jobs or a DAG of

34

such jobs.

35

36

The ResourceManager and per-machine NodeManager daemon, which manages the user processes on

37

that machine, form the computation fabric.

38

39

The per-application ApplicationMaster is, in effect, a framework specific library and is

40

tasked with negotiating resources from the ResourceManager and working with the NodeManager

41

(s) to execute and monitor the tasks.

42

43

More details are available in the YARN document:

44

<http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html>

45

46

## Usage

47

48

This charm supports the following Hadoop roles:

49

50

- HDFS: namenode, secondarynamenode and datanode ( TBD HDFS Federation)

51

- YARN: ResourceManager, NodeManager

52

53

This supports deployments of Hadoop in a number of configurations.

54

55

### HDP 2.1.3 Usage #1: Combined HDFS and MapReduce

56

57

In this configuration, the YARN ResourceManager is deployed on the same

58

service units as HDFS namenode and the HDFS datanodes also run YARN NodeManager::

59

juju deploy hdp-hadoop yarn-hdfs-master

60

juju deploy hdp-hadoop compute-node

61

juju add-unit -n 2 yarn-hdfs-master

62

juju add-relation yarn-hdfs-master:namenode compute-node:datanode

63

juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager

64

65

66

### HDP 2.1.3 Usage #2: Separate HDFS and MapReduce

67

# TBD by 9/30/2014

68

In this configuration the HDFS and YARN deployments operate on

69

different service units as separate services::

70

71

juju deploy hdp-hadoop resourcemanager

72

juju deploy hdp-hadoop namenode

73

juju add-unit -n 2 compute-node

74

juju add-relation namenode:namenode hdfs-datacluster:datanode

75

76

juju deploy hadoop mapred-resourcemanager

77

juju deploy hadoop mapred-taskcluster

78

juju add-unit -n 2 mapred-taskcluster

79

juju add-relation mapred-resourcemanager:mapred-namenode hdfs-namenode:namenode

80

juju add-relation mapred-taskcluster:mapred-namenode hdfs-namenode:namenode

81

juju add-relation mapred-resourcemanager:resourcemanager mapred-taskcluster:nodemanager

82

83

In the long term juju should support improved placement of services to

84

better support this type of deployment. This would allow mapreduce services

85

to be deployed onto machines with more processing power and hdfs services

86

to be deployed onto machines with larger storage.

87

88

### TO deploy a Hadoop service with elasticsearch service::

89

#TBD by 10/15/2014

90

91

# deploy ElasticSearch locally:

92

juju deploy elasticsearch elasticsearch

93

# elasticsearch-hadoop.jar file will be added to LIBJARS path

94

# Recommanded to use hadoop -libjars option to included elk jar file

95

juju add-unit -n elasticsearch

96

# deploy hive service by any senarios mentioned above

97

# associate Hive with elasticsearch

98

juju add-relation hadoop-master:elasticsearch elasticsearch:client

99

## Known Limitations and Issues

100

101

Note that removing the relation between namenode and datanode is destructive!

102

The role of the service is determined at the point that the relation is added

103

(it must be qualified) and CANNOT be changed later!

104

105

A single hdfs-master can support multiple slave service deployments::

106

107

juju deploy hadoop hdfs-datacluster-02

108

juju add-unit -n 2 hdfs-datacluster-02

109

juju add-relation hdfs-namenode:namenode hdfs-datacluster-02:datanode

110

111

112

113

# Contact Information

114

amir sanjar <amir.sanjar@canonical.com>

115

## Hadoop

116

117

- [Apache Hadoop](http://hadoop.apache.org/) home page

118

- [Apache Hadoop bug trackers](http://hadoop.apache.org/issue_tracking.html)

119

- [Apache Hadoop mailing lists](http://hadoop.apache.org/mailing_lists.html)

120

- [Apache Hadoop Juju Charm](http://jujucharms.com/?text=hadoop)