簡體   English   中英

根據特定列數據將 Pyspark dataframe 拆分為多個 json 文件?

[英]Split Pyspark dataframe into multiple json files based on a particular column data?

我有以下格式的 json:

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}

其類型: pyspark.sql.dataframe.DataFrame

如何將此 json 文件拆分為多個 json 文件並使用Pyspark將其保存在year目錄中? 喜歡:

目錄: path.../2020/<all split json files>

Apple.json

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }

Kiwi.json

{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}

Mango.json

{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}

Cherry.json

{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}

另外,如果我遇到不同的年份,如何以類似的方式推送文件,例如: path.../2021/<all split json files>

最初我嘗試找到所有獨特的水果並創建一個列表。 然后嘗試創建多個數據幀並將 json 值推入其中。 然后將每個 dataframe 轉換為 json 格式。 但我發現這效率低下。 然后我也檢查了這個鏈接 但這里的問題是它以 dict 形式創建了一個鍵值對,這略有不同。
然后我也了解了Pyspark groupBy方法。 這似乎是有道理的,因為我可以 groupBy() 水果值,然后拆分 json 文件,但我覺得我錯過了一些東西。

以下面的 JSON 為例

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2021", "id":"10", "fruit": "Pear","cost": "1000" }
{"year":"2021", "id":"11", "fruit": "Mango", "cost": "1100"}
{"year":"2021", "id":"12", "fruit": "Banana", "cost": "1200"}

您可以使用partitionByyearfruit對數據進行分區。 請注意,我創建了 year 列的副本,因為當您將數據寫入磁盤時,分區所在的列會被刪除。

df = spark.read.json("./ex.json")
df = df.withColumn("Year", df["year"])
df = df.withColumn("Fruit", df["fruit"])
df.write.partitionBy("Year", "Fruit").json("result")

這會產生一個名為RESULT的文件夾,其結構如下。

|-- RESULT
|   |-- Year=2020
|   |   |-- Fruit=Apple
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Cherry
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Kiwi
|   |   |   |-- part0000-dcea0683...json
|   |-- Year=2021
|   |   |-- Fruit=Banana
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Mango
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Pear
|   |   |   |-- part0000-dcea0683...json

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM