简体   繁体   中英

Spark streaming connecting to S3 gives socket timeout

I'm trying to run a Spark streaming app from my local to connect to an S3 bucket and am running into a SocketTimeoutException . This is the code to read from the bucket:

val sc: SparkContext = createSparkContext(scName)
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val ssc = new StreamingContext(sc, Seconds(time))
val lines = ssc.textFileStream("s3a://foldername/subfolder/")
lines.print()

This is the error I get:

com.amazonaws.http.AmazonHttpClient executeHelper - Unable to execute HTTP request: connect timed out
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)

I thought it might be due to the proxy so I ran my spark-submit with the proxy options like so:

    spark-submit --conf "spark.driver.extraJavaOptions=
-Dhttps.proxyHost=proxyserver.com -Dhttps.proxyPort=9000" 
--class application.jar s3module 5 5 SampleApp

That still gave me the same error. Perhaps I'm not setting the proxy properly? Is there a way to set it in the code in SparkContext's conf?

there's specific options for proxy setup, covered in the docs

<property>
  <name>fs.s3a.proxy.host</name>
  <description>Hostname of the (optional) proxy server for S3 connections.</description>
</property>

<property>
  <name>fs.s3a.proxy.port</name>
  <description>Proxy server port. If this property is not set
    but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with
    the value of fs.s3a.connection.ssl.enabled).</description>
</property>

Which can be set in spark defaults with the spark.hadoop prefix

spark.hadoop.fs.s3a.proxy.host=myproxy
spark.hadoop.fs.s3a.proxy.port-8080

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM