[英]Split Pyspark dataframe into multiple json files based on a particular column data?
我有以下格式的 json:
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}
其類型: pyspark.sql.dataframe.DataFrame
如何將此 json 文件拆分為多個 json 文件並使用Pyspark
將其保存在year
目錄中? 喜歡:
目錄: path.../2020/<all split json files>
Apple.json
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
Kiwi.json
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}
Mango.json
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
Cherry.json
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
另外,如果我遇到不同的年份,如何以類似的方式推送文件,例如: path.../2021/<all split json files>
?
最初我嘗試找到所有獨特的水果並創建一個列表。 然后嘗試創建多個數據幀並將 json 值推入其中。 然后將每個 dataframe 轉換為 json 格式。 但我發現這效率低下。 然后我也檢查了這個鏈接。 但這里的問題是它以 dict 形式創建了一個鍵值對,這略有不同。
然后我也了解了Pyspark groupBy方法。 這似乎是有道理的,因為我可以 groupBy() 水果值,然后拆分 json 文件,但我覺得我錯過了一些東西。
以下面的 JSON 為例
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2021", "id":"10", "fruit": "Pear","cost": "1000" }
{"year":"2021", "id":"11", "fruit": "Mango", "cost": "1100"}
{"year":"2021", "id":"12", "fruit": "Banana", "cost": "1200"}
您可以使用partitionBy
按year
和fruit
對數據進行分區。 請注意,我創建了 year 列的副本,因為當您將數據寫入磁盤時,分區所在的列會被刪除。
df = spark.read.json("./ex.json")
df = df.withColumn("Year", df["year"])
df = df.withColumn("Fruit", df["fruit"])
df.write.partitionBy("Year", "Fruit").json("result")
這會產生一個名為RESULT
的文件夾,其結構如下。
|-- RESULT
| |-- Year=2020
| | |-- Fruit=Apple
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Cherry
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Kiwi
| | | |-- part0000-dcea0683...json
| |-- Year=2021
| | |-- Fruit=Banana
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Mango
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Pear
| | | |-- part0000-dcea0683...json
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.