简体   繁体   English

Pyspark 将 json 数组转换为数据帧行

[英]Pyspark convert json array to dataframe rows

pyspark beginner here - I have a spark dataframe where each row is a url on s3. pyspark 初学者 - 我有一个 spark 数据框,其中每一行都是 s3 上的一个 url。 each url is a GZIP file of JSON array, I can parse each row (link) in the dataframe to a python list, But I don't know how to create multiple rows from this list of JSONs.每个 url 都是一个 JSON 数组的 GZIP 文件,我可以将数据帧中的每一行(链接)解析为一个 python 列表,但我不知道如何从这个 JSON 列表中创建多行。

this is the function I used that returns a list of jsons:这是我使用的返回 json 列表的函数:

def distributed_read_file(url):
    s3_client = boto3.client('s3')
    result = s3_client.get_object(Bucket=raw_data_bucket_name, Key=url)
    bytestream = BytesIO(result['Body'].read())
    string_json = GzipFile(None, 'rb', fileobj=bytestream).read().decode('utf-8')
    list_of_jsons = json.loads(string_json) 

If for example these are JSON objects from the list:例如,如果这些是列表中的 JSON 对象:

[{"a": 99, "b": 102}, {"a": 43, "b": 87}]

I want to run a function on the URLS dataframe, for example:我想在 URLS 数据帧上运行一个函数,例如:

result_df = urls_rdd.map(distributed_read_file)

And get a dataframe with the columns : a and b (JSON keys).并获取包含列的数据框: ab (JSON 键)。 when I tried to do that, I am getting back each json object as MapType column and it is hard for me to work with that.当我尝试这样做时,我将每个 json 对象作为 MapType 列取回,这对我来说很难处理。

Thank you very much I hope it was clear!非常感谢我希望它很清楚!

So if it helps someone, I found a solution that is really simple:所以如果它对某人有帮助,我找到了一个非常简单的解决方案:

def distributed_read_gzip(url):
    s3_client = boto3.client('s3')
    result = s3_client.get_object(Bucket=raw_data_bucket_name, Key=url)
    bytestream = BytesIO(result['Body'].read())
    string_json = GzipFile(None, 'rb', fileobj=bytestream).read().decode('utf-8')
    for json_obj in json.loads(string_json):
        yield Row(**json_obj)

while calling the function is done with a flat map, because several rows are returned for each URL:调用该函数时是使用平面地图完成的,因为每个 URL 都会返回几行:

new_rdd = urls_rdd.flatMap(distributed_read_gzip)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM