繁体   English   中英

使用 PySpark 从 S3 存储桶中获取对象

[英]Getting objects from S3 bucket using PySpark

我正在尝试使用 PySpark(在 Windows 上,使用 wsl2 终端)从 S3 存储桶中获取 JSON 个对象。

我可以使用 boto3 作为中间步骤来执行此操作,但是当我尝试使用 spark.read.json 方法时,出现错误。

代码:

import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
import multiprocessing

#----------------APACHE CONFIGURATIONS--------------
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

#---------------spark--------------
conf = (
    SparkConf()
    .set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
    .set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
    .setAppName('pyspark_aws')
    .setMaster(f"local[{multiprocessing.cpu_count()}]")
    .setIfMissing("spark.executor.memory", "2g")
        )
        

sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
spark=SparkSession(sc)
#--------------hadoop--------------
accessKeyId='xxxxxxxxxxxx'
secretAccessKey='xxxxxxxxx'

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3-eu-west-1.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.multipart.size', '419430400')
hadoopConf.set('fs.s3a.multipart.threshold', '2097152000')
hadoopConf.set('fs.s3a.connection.maximum', '500')
hadoopConf.set('s3a.connection.timeout', '600000')


s3_df = spark.read.json('s3a://{bucket}/{directory}/{object}.json')

错误:

py4j.protocol.Py4JJavaError: An error occurred while call
: java.lang.NumberFormatException: For input string: "32M
        at java.base/java.lang.NumberFormatException.forI
        at java.base/java.lang.Long.parseLong(Long.java:6
        at java.base/java.lang.Long.parseLong(Long.java:8
        at org.apache.hadoop.conf.Configuration.getLong(C
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getDefa
        at org.apache.hadoop.fs.FileSystem.getDefaultBloc
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
        at org.apache.hadoop.fs.FileSystem.exists(FileSys
        at org.apache.spark.sql.execution.datasources.Dat
        at org.apache.spark.sql.execution.datasources.Dat
        at org.apache.spark.util.ThreadUtils$.$anonfun$pa
        at java.base/java.util.concurrent.ForkJoinTask$Ruava.util.coteAction.exec(ForkJoinTask.java:1426)ncurrent.Fojava.base/java.util.concurrent.ForkJoinTask.dorkJoinWorkejava.base/java.util.concurrent.ForkJoinPool$WorThread.runjava.base/java.util.concurrent.ForkJoinPool.sc(ForkJoinWojava.base/java.util.concurrent.ForkJoinPool.rurkerThread.java.base/java.util.concurrent.ForkJoinWorkerTjava:183)

当我之前遇到类似错误时,我添加了 multipart.size、multipart.threshold、connection.maximum、connection.timeout hadoop conf 设置(这个较早的错误有“64M”而不是“32M”,并在我添加这些 conf 设置时更改)

我是 Spark 的新手,所以任何和所有提示/指针都会有所帮助!

如果需要的话

“32M”是“fs.s3a.block.size”的默认值

试试hadoopConf.set('fs.s3a.block.size', '33554432')

go 至https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

你会找到“32M”和“64M”的解释

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM