Hadoop distcp to S3 behind HTTP proxy

Question

I'm trying to use distcp to copy some files from HDFS to Amazon s3. My Hadoop cluster connects to the internet through an HTTP proxy, but I can't figure out how to specify this when connecting to s3. I'm currently getting the issue:

httpclient.HttpMethodDirector: I/O exception (org.apache.commons.httpclient.ConnectTimeoutException) caught when processing request: The host did not accept the connection within timeout of 60000 ms

This indicates that it's trying to connect directly to amazon. How do I get distcp to use the proxy host?

Answer 1

I post another answer here because it's the first SOW question that comes up in Google when asking for hdfs s3 proxy and the existing answer is not the best according to me.

Configuring S3 for HDFS is best done on the hdfs-site.xml file on each nodes. By this it works for distcp (to copy from HDFS to S3 and the opposite) but also with Impala and potentially other Hadoop components that can use S3.

So, add the following properties to your hdfs-site.xml :

<property>
      <name>fs.s3a.access.key</name>
      <value>your_access_key</value>
    </property>
    <property>
      <name>fs.s3a.secret.key</name>
      <value>your_secret_key</value>
    </property>
    <property>
      <name>fs.s3a.proxy.host</name>
      <value>your_proxy_host</value>
    </property>
    <property>
      <name>fs.s3a.proxy.port</name>
      <value>your_proxy_port</value>
    </property>

Answer 2

Set these properties in the file /etc/hadoop/conf/jets3t.properties

httpclient.proxy-host = proxy.domain.com
httpclient.proxy-port = 12345

If this is documented anywhere, I can't find it. But the code that handles it is in the RestS3Service class. You will need this file distributed to all the nodes so it can do a distributed copy.

Hadoop distcp to S3 behind HTTP proxy

Question

2 answers

solution1
1 2016-08-04 10:08:27

solution2
0 2014-05-15 18:28:03

Hadoop distcp to S3 behind HTTP proxy

Question

2 answers

solution1 1 2016-08-04 10:08:27

solution2 0 2014-05-15 18:28:03

solution1
1 2016-08-04 10:08:27

solution2
0 2014-05-15 18:28:03