~bigdata-dev/charms/trusty/hdp-tez/trunk : revision 3

1

# Overview

2

Data warehouse infrastructure built on top of Hortonwork Apache HIVE.

2

Apache Tez, a Framework for YARN-based, Data Processing Applications In Hadoop.

3

4

Hortonworks Apache Hive 0.12.x is a data warehouse infrastructure built

5

on top of Hortonworks Hadoop 2.4.1 that

6

provides tools to enable easy data summarization, adhoc querying and

7

analysis of large datasets data stored in Hadoop files. It provides a

8

mechanism to put structure on this data and it also provides a simple

9

query language called Hive QL which is based on SQL and which enables

10

users familiar with SQL to query this data. At the same time, this

11

language also allows traditional map/reduce programmers to be able to

12

plug in their custom mappers and reducers to do more sophisticated

13

analysis which may not be supported by the built-in capabilities of

14

the language.

15

16

Hive provides:

17

18

- HiveQL - An SQL dialect language for querying data in a RDBMS fashion

19

- UDF/UDAF/UDTF (User Defined [Aggregate/Table] Functions) - Allows user to

20

create custom Map/Reduce based functions for regular use

21

- Ability to do joins (inner/outer/semi) between tables

22

- Support (limited) for sub-queries

23

- Support for table 'Views'

24

- Ability to partition data into Hive partitions or buckets to enable faster

25

querying

26

- Hive Web Interface - A web interface to Hive

27

- Hive Server2 - Supports multi-suer querying using Thrift, JDBC and ODBC clients

28

- Hive Metastore - Ability to run a separate Metadata storage process

29

-* Hive cli - A Hive commandline that supports HiveQL

30

31

See [http://hive.apache.org]http://hive.apache.org) for more information.

32

33

This charm provides the Hive Server and Metastore roles which form part of an

34

overall Hive deployment.

4

Apache™ Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications in Hadoop that need to handle TB to PB scale datasets. It allows projects in the Hadoop ecosystem, such as Apache Hive and Apache Pig, as well as 3rd-party software vendors to express fit-to-purpose data processing applications in a way that meets their unique demands for fast response times and extreme throughput at petabyte scale.

5

6

Why Apache Tez

7

Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows applications to seamlessly span the scalability dimension from GB’s to PB’s of data and 10’s to 1000’s of nodes. The Apache Tez component library allows developers to use Tez to create Hadoop applications that integrate with YARN and perform well within mixed workload Hadoop clusters.

8

9

And, since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over general-purpose, end-user-facing engines such as MapReduce and Spark. Finally, it offers a customizable execution architecture that allows you to express complex computations as dataflow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.

35

10

36

11

# Usage

37

12

38

A Hive deployment consists of a Hive service, a RDBMS (only MySQL is currently

39

supported), an optional Metastore service and a Hadoop cluster.

13

Verify that your cluster meets the following pre-requisites before installing Tez:

14

Apache Hadoop 2.4.x & YARN

40

15

41

To deploy a simple four node Hadoop cluster (see Hadoop charm README for further

42

information)::

16

**To deploy a four node Hadoop cluster**

43

17

juju deploy hdp-hadoop yarn-hdfs-master

44

18

juju deploy hdp-hadoop compute-node

45

juju add-unit -n 2 yarn-hdfs-master

19

juju add-unit -n 2 compute-node

46

20

juju add-relation yarn-hdfs-master:namenode compute-node:datanode

47

21

juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager

48

22

49

A Hive server stores metadata in MySQL::

50

51

juju deploy mysql

52

# hive requires ROW binlog

53

juju set mysql binlog-format=ROW

54

55

To deploy a Hive service without a Metastore service::

56

57

# deploy Hive instance (hive-server2)

58

juju deploy hdp-hive hdphive

59

# associate Hive with MySQL

60

juju add-relation hdphive:db mysql:db

61

62

# associate Hive with HDFS Namenode

63

juju add-relation hdphive:namenode yarn-hdfs-master:namenode

64

# associate Hive with resourcemanager

65

juju add-relation hdphive:resourcemanager yarn-hdfs-master:resourcemanager

23

**To deploy a Tez Client::**

24

25

juju deploy hdp-tez hdp-tez$1

26

juju add-relation hdp-tez$1:resourcemanager yarn-hdfs-master:resourcemanager

27

juju add-relation hdp-tez$1:namenode yarn-hdfs-master:namenode

66

28

67

29

68

30

## Scale out Usage

69

31

70

If the charm has any recommendations for running at scale, outline them in examples here. For example if you have a memcached relation that improves performance, mention it here.

71

72

## Known Limitations and Issues

73

74

This not only helps users but gives people a place to start if they want to help you add features to your charm.

32

juju add-unit -n 2 compute-node

33

34

## verify deployement

35

**install**

36

execute:

37

>> juju run "sudo su hdfs -c 'hdfs dfs -ls /apps/tez'" --unit hdp-tez/0

38

successful result:

39

hdfs users ... /apps/tez/conf

40

hdfs users ... /apps/tez/lib

41

hdfs users ... /apps/tez/tez-api-0.4.0.2.1.3.0-563.jar

42

hdfs users ... /apps/tez/tez-common-0.4.0.2.1.3.0-563.jar

43

hdfs users ... /apps/tez/tez-dag-0.4.0.2.1.3.0-563.jar

44

hdfs users ... /apps/tez/tez-mapreduce-0.4.0.2.1.3.0-563.jar

45

hdfs users ... /apps/tez/tez-mapreduce-examples-0.4.0.2.1.3.0-563.jar

46

hdfs users ... /apps/tez/tez-runtime-internals-0.4.0.2.1.3.0-563.jar

47

hdfs users ... /apps/tez/tez-runtime-library-0.4.0.2.1.3.0-563.jar

48

hdfs users ... /apps/tez/tez-tests-0.4.0.2.1.3.0-563.jar

49

50

**HDFS validation from Tez Client**

51

1) Remote HDFS Cluster health

52

>> juju run "su hdfs -c 'hdfs dfsadmin -report '" --unit hdp-tez/0

53

** validate the return information **

54

2) Validate Create directory on the hdfs cluster

55

>> juju run "su hdfs -c 'hdfs dfs -mkdir /tmp1'" --unit hdp-tez/0

56

3) Copy a test data file to hdfs cluster

57

>> juju run "su hdfs -c 'hdfs dfs -put /home/ubuntu/pg4300.txt /tmp '" --unit hdp-tez/0

58

4) Run Tez world-count example -

59

>> juju run "/home/ubuntu/runtez_wc.sh" --unit hdp-tez/0

60

5) View the result save on hdfs cluster:

61

>> juju run "su hdfs -c 'hdfs dfs -cat /tmp/pg4300.out/* '" --unit hdp-tez/0

62

75

63

76

64

# Configuration

77

65