3
Hadoop and Swift integration is the essential continuation of Hadoop&OpenStack
4
marriage. There were two steps to achieve this:
6
* Hadoop side: https://issues.apache.org/jira/browse/HADOOP-8545
7
This patch is not merged yet and is still being developed, so that's why
8
there is an ability to get the latest-version jar file from CDN:
9
http://sahara-files.mirantis.com/hadoop-swift/hadoop-swift-latest.jar
10
* Swift side: https://review.openstack.org/#/c/21015
11
This patch is merged into Grizzly. If you want to make it work in Folsom
12
see the instructions in the section below.
17
If you are still using Folsom you need to follow these steps:
19
* Go to proxy server and find proxy-server.conf file. Go to ``[pipeline-main]``
20
section and insert a new filter BEFORE 'authtoken' filter. The name of your
21
new filter is not very important, you will use it only for configuration.
22
E.g. let it be ``${list_endpoints}``:
27
pipeline = catch_errors healthcheck cache ratelimit swift3 s3token list_endpoints authtoken keystone proxy-server
30
The next thing you need to do here is to add the description of new filter:
34
[filter:list_endpoints]
35
use = egg:swift#${list_endpoints}
36
# list_endpoints_path = /endpoints/
39
``list_endpoints_path`` is not mandatory and is "endpoints" by default.
40
This param is used for http-request construction. See details below.
42
* Go to ``entry_points.txt`` in egg-info. For swift-1.7.4 it may be found in
43
``/usr/lib/python2.7/dist-packages/swift-1.7.4.egg-info/entry_points.txt``.
44
Add the following description to ``[paste.filter_factory]`` section:
48
${list_endpoints} = swift.common.middleware.list_endpoints:filter_factory
50
* And the last step: put `list_endpoints.py <https://review.openstack.org/#/c/21015/7/swift/common/middleware/list_endpoints.py>`_
51
to ``/python2.7/dist-packages/swift/common/middleware/``.
53
Is Swift was patched successfully?
54
----------------------------------
55
You may check if patching is successful just sending the following http requests:
59
http://${proxy}:8080/endpoints/${account}/${container}/${object}
60
http://${proxy}:8080/endpoints/${account}/${container}
61
http://${proxy}:8080/endpoints/${account}
63
You don't need any additional headers here and authorization
64
(see previous section: filter ${list_endpoints} is before 'authtoken' filter).
65
The response will contain ip's of all swift nodes which contains the corresponding object.
70
You may build jar file by yourself choosing the latest patch from
71
https://issues.apache.org/jira/browse/HADOOP-8545. Or you may get the latest
72
one from CDN http://sahara-files.mirantis.com/hadoop-swift/hadoop-swift-latest.jar
73
You need to put this file to hadoop libraries (e.g. /usr/lib/share/hadoop/lib)
74
into each job-tracker and task-tracker node in cluster. The main step in this
75
section is to configure core-site.xml file on each of this node.
79
All of configs may be rewritten by Hadoop-job or set in ``core-site.xml``
85
<name>${name} + ${config}</name>
86
<value>${value}</value>
87
<description>${not mandatory description}</description>
91
There are two types of configs here:
93
1. General. The ``${name}`` in this case equals to ``fs.swift``. Here is the list of ``${config}``:
95
* ``.impl`` - Swift FileSystem implementation. The ${value} is ``org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem``
96
* ``.connect.timeout`` - timeout for all connections by default: 15000
97
* ``.socket.timeout`` - how long the connection waits for responses from servers. by default: 60000
98
* ``.connect.retry.count`` - connection retry count for all connections. by default: 3
99
* ``.connect.throttle.delay`` - delay in millis between bulk (delete, rename, copy operations). by default: 0
100
* ``.blocksize`` - blocksize for filesystem. By default: 32Mb
101
* ``.partsize`` - the partition size for uploads. By default: 4608*1024Kb
102
* ``.requestsize`` - request size for reads in KB. By default: 64Kb
106
2. Provider-specific. Patch for Hadoop supports different cloud providers.
107
The ``${name}`` in this case equals to ``fs.swift.service.${provider}``.
109
Here is the list of ``${config}``:
111
* ``.auth.url`` - authorization URL
117
* ``.region`` - Swift region is used when cloud has more than one Swift
118
installation. If region param is not set first region from Keystone endpoint
119
list will be chosen. If region param not found exception will be thrown.
120
* ``.location-aware`` - turn On location awareness. Is false by default
127
By this point Swift and Hadoop is ready for use. All configs in hadoop is ok.
129
In example below provider's name is ``sahara``. So let's copy one object
130
to another in one swift container and account. E.g. /dev/integration/temp
131
to /dev/integration/temp1. Will use distcp for this purpose:
132
http://hadoop.apache.org/docs/r0.19.0/distcp.html
134
How to write swift path? In our case it will look as follows: ``swift://integration.sahara/temp``.
135
So the template is: ``swift://${container}.${provider}/${object}``.
136
We don't need to point out the account because it will be automatically
137
determined from tenant name from configs. Actually, account=tenant.
141
.. sourcecode:: console
143
$ hadoop distcp -D fs.swift.service.sahara.username=admin \
144
-D fs.swift.service.sahara.password=swordfish \
145
swift://integration.sahara/temp swift://integration.sahara/temp1
147
After that just check if temp1 is created.
152
**Note:** Please note that container name should be a valid URI.