pyspark：如何在 spark dataframe 中对 N 条记录进行分组

Question

I have a CSV with 5 million records, with the structure:我有一个 CSV 有 500 万条记录，结构如下：

+----------+------------+------------+
|  row_id  |    col1    |    col2    |
+----------+------------+------------+
|         1|   value    |    value   |
|         2|   value    |    value   |
|....                                |
|...                                 |
|   5000000|   value    |    value   |
+----------+------------+------------+

I need to convert this CSV to JSON with each json-file having 500 records and a particular structure like this:我需要将这个 CSV 转换为 JSON，每个 json 文件都有 500 条记录和一个特定的结构，如下所示：

{
    "entry": [
        {
            "row_id": "1",
            "col1": "value",
            "col2": "value"
        },
        {
            "row_id": "2",
            "col1": "value",
            "col2": "value"
        },
        ....
        ..
        {
            "row_id": "500",
            "col1": "value",
            "col2": "value"
        }
    ],
    "last_updated":"09-09-2021T01:03:04.44Z"
}

Using PySpark I am able to read the csv and create a dataframe. I don't know how to group 500 records in a single json of the structure "entry": [ <500 records> ],"last_updated":"09-09-2021T01:03:04.44Z"使用PySpark我能够读取 csv 并创建一个 dataframe。我不知道如何将 500 条记录分组到结构"entry": [ <500 records> ],"last_updated":"09-09-2021T01:03:04.44Z"
I can use df.coalesce(1).write.option("maxRecordsPerFile",500) but that will give me only the set of 500 records, without any structure.我可以使用df.coalesce(1).write.option("maxRecordsPerFile",500)但这只会给我 500 条记录的集合，没有任何结构。 I want those 500 records in the "entry" list and "last_updated" following it (which I am taking from datetime.now() ).我想要"entry"列表中的那 500 条记录和它后面的"last_updated" （我从datetime.now()中获取）。

Answer 1

You may try the following:您可以尝试以下操作：

NB.注意。 I've used the following imports.我使用了以下导入。

from pyspark.sql import functions as F
from pyspark.sql import Window

1 . 1 . We need a column that can be used to split your data in 500 record batches我们需要一个可用于将您的数据拆分为 500 个记录批次的列

(Recommended) We can create a pseudo column to achieve this with row_number （推荐）我们可以创建一个伪列来实现这一点row_number

df = df.withColumn("group_num",(F.row_number().over(Window.orderBy("row_id"))-1) % 500 )

otherwise, if row_id starting at 1 is consistently increasing in the 5 million records, we may use that否则，如果从1开始的row_id在 500 万条记录中持续增加，我们可以使用

df = df.withColumn("group_num",(F.col("row_id")-1) % 500 )

or in that odd chance that the column "last_updated":"09-09-2021T01:03:04.44Z" is unique to each batch of 500 records或者在这种奇怪的情况下， "last_updated":"09-09-2021T01:03:04.44Z"列对于每批 500 条记录都是唯一的

df = df.withColumn("group_num",F.col("last_updated"))

2 . 2 . We will transform your dataset by grouping by the group_num我们将通过按group_num分组来转换您的数据集

df = (
    df.groupBy("group_num")
      .agg(
          F.collect_list(
              F.expr("struct(row_id,col1,col2)")
          ).alias("entries")
      )
      .withColumn("last_updated",F.lit(datetime.now())))
      .drop("group_num")
)

NB.注意。 If you would like to include all columns you may use F.expr("struct(*)") instead of F.expr("struct(row_id,col1,col2)") .如果您想包括所有列，您可以使用F.expr("struct(*)")而不是F.expr("struct(row_id,col1,col2)") 。

3 . 3 . Finally you can write to your output/destination with the option .option("maxRecordsPerFile",1) since each row now stores at most 500 entries最后，您可以使用选项.option("maxRecordsPerFile",1)写入输出/目标，因为现在每行最多存储 500 个条目

Eg.例如。

df.write.format("json").option("maxRecordsPerFile",1).save("<your intended path here>")

Let me know if this works for you让我知道这是否适合你

pyspark：如何在 spark dataframe 中对 N 条记录进行分组

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-09-30 14:59:14

pyspark：如何在 spark dataframe 中对 N 条记录进行分组

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-09-30 14:59:14

解决方案1
2 已采纳 2021-09-30 14:59:14