根据特定列数据将 Pyspark dataframe 拆分为多个 json 文件？

Question

I have the following json of format:我有以下格式的 json：

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}

Its of type: pyspark.sql.dataframe.DataFrame其类型： pyspark.sql.dataframe.DataFrame

How can I split this json file into multiple json files and save it in a year directory using Pyspark ?如何将此 json 文件拆分为多个 json 文件并使用Pyspark将其保存在year目录中？ like:喜欢：

directory: path.../2020/<all split json files>目录： path.../2020/<all split json files>

Apple.json

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }

Kiwi.json

{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}

Mango.json

{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}

Cherry.json

{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}

Also if I encounter a different year, how do push the files in similar way like: path.../2021/<all split json files> ?另外，如果我遇到不同的年份，如何以类似的方式推送文件，例如： path.../2021/<all split json files> ？

Initially I tried, finding all the unique fruits and create a list.最初我尝试找到所有独特的水果并创建一个列表。 Then tried creating multiple data frames & pushing the json values into it.然后尝试创建多个数据帧并将 json 值推入其中。 Then converted every dataframe into a json format.然后将每个 dataframe 转换为 json 格式。 But I find this inefficient.但我发现这效率低下。 Then I also checked this link .然后我也检查了这个链接。 But issue here is it creates a key value pair in dict form, which is slightly different.但这里的问题是它以 dict 形式创建了一个键值对，这略有不同。
Then I also learned about Pyspark groupBy method.然后我也了解了Pyspark groupBy方法。 It seems to make sense because I could groupBy() the fruit values and then split the json file, but I feel I am missing something.这似乎是有道理的，因为我可以 groupBy() 水果值，然后拆分 json 文件，但我觉得我错过了一些东西。

Answer 1

Using the following JSON as an example以下面的 JSON 为例

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2021", "id":"10", "fruit": "Pear","cost": "1000" }
{"year":"2021", "id":"11", "fruit": "Mango", "cost": "1100"}
{"year":"2021", "id":"12", "fruit": "Banana", "cost": "1200"}

You can use partitionBy to partion the data by year and fruit .您可以使用partitionBy按year和fruit对数据进行分区。 Note that I created a duplicate of the year column as the column that you partition on is dropped when you write the data to disk.请注意，我创建了 year 列的副本，因为当您将数据写入磁盘时，分区所在的列会被删除。

df = spark.read.json("./ex.json")
df = df.withColumn("Year", df["year"])
df = df.withColumn("Fruit", df["fruit"])
df.write.partitionBy("Year", "Fruit").json("result")

This results in a folder called RESULT with the following structure.这会产生一个名为RESULT的文件夹，其结构如下。

|-- RESULT
|   |-- Year=2020
|   |   |-- Fruit=Apple
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Cherry
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Kiwi
|   |   |   |-- part0000-dcea0683...json
|   |-- Year=2021
|   |   |-- Fruit=Banana
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Mango
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Pear
|   |   |   |-- part0000-dcea0683...json

根据特定列数据将 Pyspark dataframe 拆分为多个 json 文件？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-01-19 16:24:25

根据特定列数据将 Pyspark dataframe 拆分为多个 json 文件？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-01-19 16:24:25

解决方案1
0 已采纳 2022-01-19 16:24:25