[英]Load JSON from s3 inside aws glue pyspark job
I am trying to retrieve a JSON file from an s3 bucket inside a glue pyspark script. 我正在尝试从粘合pyspark脚本内的s3存储桶中检索JSON文件。
I am running this function in the job inside aws glue: 我在aws胶内的作业中运行此功能:
def run(spark):
s3_bucket_path = 's3://bucket/data/file.gz'
df = spark.read.json(s3_bucket_path)
df.show()
After this I am getting: AnalysisException: u'Path does not exist: s3://bucket/data/file.gz;' 之后,我得到:AnalysisException:u'路径不存在:s3://bucket/data/file.gz;'
I searched for this issue and did not find anything that would be similar enough to infer where is the issue. 我搜索了此问题,但没有发现任何类似的东西可以推断出问题出在哪里。 I think there might be permission issues accessing the bucket, but then the error message should be different.
我认为访问存储分区可能存在权限问题,但是错误消息应该有所不同。
Here You can Try This : 在这里您可以尝试:
s3 = boto3.client("s3", region_name="us-west-2", aws_access_key_id="
", aws_secret_access_key="")
jsonFile = s3.get_object(Bucket=bucket, Key=key)
jsonObject = json.load(jsonFile["Body"])
where Key
= full path to your file in bucket
其中
Key
= full path to your file in bucket
and use this jsonObject
in spark.read.json(jsonObject)
并在
spark.read.json(jsonObject)
使用此jsonObject
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.