在AWS Glus pyspark作业中从s3加载JSON

Question

I am trying to retrieve a JSON file from an s3 bucket inside a glue pyspark script. 我正在尝试从粘合pyspark脚本内的s3存储桶中检索JSON文件。

I am running this function in the job inside aws glue: 我在aws胶内的作业中运行此功能：

def run(spark):
    s3_bucket_path = 's3://bucket/data/file.gz'

    df = spark.read.json(s3_bucket_path)
    df.show()

After this I am getting: AnalysisException: u'Path does not exist: s3://bucket/data/file.gz;' 之后，我得到：AnalysisException：u'路径不存在：s3：//bucket/data/file.gz;'

I searched for this issue and did not find anything that would be similar enough to infer where is the issue. 我搜索了此问题，但没有发现任何类似的东西可以推断出问题出在哪里。 I think there might be permission issues accessing the bucket, but then the error message should be different. 我认为访问存储分区可能存在权限问题，但是错误消息应该有所不同。

Answer 1

Here You can Try This : 在这里您可以尝试：

    s3 = boto3.client("s3", region_name="us-west-2", aws_access_key_id=" 
        ", aws_secret_access_key="")
    jsonFile = s3.get_object(Bucket=bucket, Key=key)
    jsonObject = json.load(jsonFile["Body"])

where Key = full path to your file in bucket 其中Key = full path to your file in bucket

and use this jsonObject in spark.read.json(jsonObject) 并在spark.read.json(jsonObject)使用此jsonObject

在AWS Glus pyspark作业中从s3加载JSON

问题描述

1 个解决方案

解决方案1
7 2018-08-14 15:18:30

在AWS Glus pyspark作业中从s3加载JSON

问题描述

1 个解决方案

解决方案1 7 2018-08-14 15:18:30

解决方案1
7 2018-08-14 15:18:30