简体   繁体   English

尝试使用 EC2 中的 Spark 读取文件列表时,方案“s3”没有文件系统

[英]No FileSystem for scheme "s3" when trying to read a list of files with Spark from EC2

I'm trying to provide a list of files for spark to read as and when it needs them (which is why I'd rather not use boto or whatever else to pre-download all the files onto the instance and only then read them into spark "locally").我正在尝试提供一个文件列表供 spark 在需要时读取(这就是为什么我宁愿不使用 boto 或其他任何东西将所有文件预下载到实例上然后才将它们读入火花“本地”)。

os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[3] pyspark-shell"
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['AccessKeyId'])
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['SecretAccessKey'])
spark.read.json(['s3://url/3521.gz', 's3://url/2734.gz'])

No idea what local[3] is about but without this --master flag, I was getting another exception:不知道local[3]是关于什么的,但是没有这个--master标志,我得到了另一个异常:

Exception: Java gateway process exited before sending the driver its port number.

Now, I'm getting this:现在,我得到这个:

Py4JJavaError: An error occurred while calling o37.json.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
...

Not sure what o37.json refers to here but it probably doesn't matter.不确定o37.json在这里指的是什么,但这可能无关紧要。

I saw a bunch of answers to similar questions suggesting an addition of flags like:我看到了一堆类似问题的答案,建议添加如下标志:

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"

I tried prepending it and appending it to the other flag but it doesn't work.我尝试将它放在前面并将其附加到另一个标志,但它不起作用。

Just like the many variations I see in other answers and elsewhere on the inte.net (with different packages and versions), for example:就像我在其他答案和 inte.net 其他地方看到的许多变体一样(具有不同的包和版本),例如:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'

A typical example for reading files from S3 is as below -从 S3 读取文件的典型示例如下 -

Additional you can go through this answer to ensure the minimalistic structure and necessary modules are in place - java.io.IOException: No FileSystem for scheme: s3另外你可以通过这个答案 go 来确保简约的结构和必要的模块到位 - java.io.IOException: No FileSystem for scheme: s3

Read Parquet - S3阅读镶木地板 - S3

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=com.amazonaws:aws-java-sdk-bundle:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell"


sc = SparkContext.getOrCreate()
sql = SQLContext(sc)

hadoop_conf = sc._jsc.hadoopConfiguration()

config = configparser.ConfigParser()

config.read(os.path.expanduser("~/.aws/credentials"))

access_key = config.get("****", "aws_access_key_id")
secret_key = config.get("****", "aws_secret_access_key")
session_key = config.get("****", "aws_session_token")


hadoop_conf.set("fs.s3.aws.credentials.provider", "org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.session.token", session_key)

s3_path = "s3a://xxxx/yyyy/zzzz/"

sparkDF = sql.read.parquet(s3_path) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将文件从 ec2 复制到 s3 - Copy files from ec2 to s3 CodePipeline/CodeDeploy 将文件从 S3 移动到 EC2 - CodePipeline/CodeDeploy move files from S3 to EC2 如何使用 AWS CLI 将文件从 EC2 移动到 S3? 一旦传输到 S3,文件应从 EC2 中删除 - How to move files from EC2 to S3 using AWS CLI ? The files should be deleted from EC2 once transferred to S3 无法在 AWS 上的 EC2 实例上从 S3 读取 csv 到 pyspark dataframe - Can't read csv from S3 to pyspark dataframe on a EC2 instance on AWS 当 EC2 实例和 S3 存储桶位于同一区域时,如果我们通过 EC2 实例从 S3 存储桶中获取数据,CloudFront 是否有用? - Does CloudFront is useful if we fetch data from S3 bucket through EC2 Instance when EC2 Instance and S3 bucket are in Same Region? 从 S3 备份恢复 Ec2(不是 RDS)上的 MariaDB - restore MariaDB on Ec2 (not RDS) from S3 backup 我应该为我的文件使用 s3 存储桶还是应该只使用我的 ec2 实例 - Should I use an s3 bucket for my files or should I just stick to my ec2 instance 为什么我的 lambda function 不能访问 S3 和 SQS 而同一 VPC 中的 EC2 实例可以访问? - Why can my lambda function not access S3 and SQS when an EC2 instance in the same VPC can? api 从 aws s3 cloudfront 到 ec2 的请求 403 错误 - api request 403 error from aws s3 cloudfront to ec2 通过 Bitbucket 管道从 AWS S3 部署到 AWS EC2 - Deploy to AWS EC2 from AWS S3 via Bitbucket Pipelines
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM