2
Data warehouse infrastructure built on top of Hortonwork Apache HIVE.
2
Apache Tez, a Framework for YARN-based, Data Processing Applications In Hadoop.
4
Hortonworks Apache Hive 0.12.x is a data warehouse infrastructure built
5
on top of Hortonworks Hadoop 2.4.1 that
6
provides tools to enable easy data summarization, adhoc querying and
7
analysis of large datasets data stored in Hadoop files. It provides a
8
mechanism to put structure on this data and it also provides a simple
9
query language called Hive QL which is based on SQL and which enables
10
users familiar with SQL to query this data. At the same time, this
11
language also allows traditional map/reduce programmers to be able to
12
plug in their custom mappers and reducers to do more sophisticated
13
analysis which may not be supported by the built-in capabilities of
18
- HiveQL - An SQL dialect language for querying data in a RDBMS fashion
19
- UDF/UDAF/UDTF (User Defined [Aggregate/Table] Functions) - Allows user to
20
create custom Map/Reduce based functions for regular use
21
- Ability to do joins (inner/outer/semi) between tables
22
- Support (limited) for sub-queries
23
- Support for table 'Views'
24
- Ability to partition data into Hive partitions or buckets to enable faster
26
- Hive Web Interface - A web interface to Hive
27
- Hive Server2 - Supports multi-suer querying using Thrift, JDBC and ODBC clients
28
- Hive Metastore - Ability to run a separate Metadata storage process
29
-* Hive cli - A Hive commandline that supports HiveQL
31
See [http://hive.apache.org]http://hive.apache.org) for more information.
33
This charm provides the Hive Server and Metastore roles which form part of an
34
overall Hive deployment.
4
Apache™ Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications in Hadoop that need to handle TB to PB scale datasets. It allows projects in the Hadoop ecosystem, such as Apache Hive and Apache Pig, as well as 3rd-party software vendors to express fit-to-purpose data processing applications in a way that meets their unique demands for fast response times and extreme throughput at petabyte scale.
7
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows applications to seamlessly span the scalability dimension from GB’s to PB’s of data and 10’s to 1000’s of nodes. The Apache Tez component library allows developers to use Tez to create Hadoop applications that integrate with YARN and perform well within mixed workload Hadoop clusters.
9
And, since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over general-purpose, end-user-facing engines such as MapReduce and Spark. Finally, it offers a customizable execution architecture that allows you to express complex computations as dataflow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.
38
A Hive deployment consists of a Hive service, a RDBMS (only MySQL is currently
39
supported), an optional Metastore service and a Hadoop cluster.
13
Verify that your cluster meets the following pre-requisites before installing Tez:
14
Apache Hadoop 2.4.x & YARN
41
To deploy a simple four node Hadoop cluster (see Hadoop charm README for further
16
**To deploy a four node Hadoop cluster**
43
17
juju deploy hdp-hadoop yarn-hdfs-master
44
18
juju deploy hdp-hadoop compute-node
45
juju add-unit -n 2 yarn-hdfs-master
19
juju add-unit -n 2 compute-node
46
20
juju add-relation yarn-hdfs-master:namenode compute-node:datanode
47
21
juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager
49
A Hive server stores metadata in MySQL::
52
# hive requires ROW binlog
53
juju set mysql binlog-format=ROW
55
To deploy a Hive service without a Metastore service::
57
# deploy Hive instance (hive-server2)
58
juju deploy hdp-hive hdphive
59
# associate Hive with MySQL
60
juju add-relation hdphive:db mysql:db
62
# associate Hive with HDFS Namenode
63
juju add-relation hdphive:namenode yarn-hdfs-master:namenode
64
# associate Hive with resourcemanager
65
juju add-relation hdphive:resourcemanager yarn-hdfs-master:resourcemanager
23
**To deploy a Tez Client::**
25
juju deploy hdp-tez hdp-tez$1
26
juju add-relation hdp-tez$1:resourcemanager yarn-hdfs-master:resourcemanager
27
juju add-relation hdp-tez$1:namenode yarn-hdfs-master:namenode
70
If the charm has any recommendations for running at scale, outline them in examples here. For example if you have a memcached relation that improves performance, mention it here.
72
## Known Limitations and Issues
74
This not only helps users but gives people a place to start if they want to help you add features to your charm.
32
juju add-unit -n 2 compute-node
37
>> juju run "sudo su hdfs -c 'hdfs dfs -ls /apps/tez'" --unit hdp-tez/0
39
hdfs users ... /apps/tez/conf
40
hdfs users ... /apps/tez/lib
41
hdfs users ... /apps/tez/tez-api-0.4.0.2.1.3.0-563.jar
42
hdfs users ... /apps/tez/tez-common-0.4.0.2.1.3.0-563.jar
43
hdfs users ... /apps/tez/tez-dag-0.4.0.2.1.3.0-563.jar
44
hdfs users ... /apps/tez/tez-mapreduce-0.4.0.2.1.3.0-563.jar
45
hdfs users ... /apps/tez/tez-mapreduce-examples-0.4.0.2.1.3.0-563.jar
46
hdfs users ... /apps/tez/tez-runtime-internals-0.4.0.2.1.3.0-563.jar
47
hdfs users ... /apps/tez/tez-runtime-library-0.4.0.2.1.3.0-563.jar
48
hdfs users ... /apps/tez/tez-tests-0.4.0.2.1.3.0-563.jar
50
**HDFS validation from Tez Client**
51
1) Remote HDFS Cluster health
52
>> juju run "su hdfs -c 'hdfs dfsadmin -report '" --unit hdp-tez/0
53
** validate the return information **
54
2) Validate Create directory on the hdfs cluster
55
>> juju run "su hdfs -c 'hdfs dfs -mkdir /tmp1'" --unit hdp-tez/0
56
3) Copy a test data file to hdfs cluster
57
>> juju run "su hdfs -c 'hdfs dfs -put /home/ubuntu/pg4300.txt /tmp '" --unit hdp-tez/0
58
4) Run Tez world-count example -
59
>> juju run "/home/ubuntu/runtez_wc.sh" --unit hdp-tez/0
60
5) View the result save on hdfs cluster:
61
>> juju run "su hdfs -c 'hdfs dfs -cat /tmp/pg4300.out/* '" --unit hdp-tez/0