[英]pyspark: how to group N records in a spark dataframe
I have a CSV with 5 million records, with the structure:我有一个 CSV 有 500 万条记录,结构如下:
+----------+------------+------------+
| row_id | col1 | col2 |
+----------+------------+------------+
| 1| value | value |
| 2| value | value |
|.... |
|... |
| 5000000| value | value |
+----------+------------+------------+
I need to convert this CSV to JSON with each json-file having 500 records and a particular structure like this:我需要将这个 CSV 转换为 JSON,每个 json 文件都有 500 条记录和一个特定的结构,如下所示:
{
"entry": [
{
"row_id": "1",
"col1": "value",
"col2": "value"
},
{
"row_id": "2",
"col1": "value",
"col2": "value"
},
....
..
{
"row_id": "500",
"col1": "value",
"col2": "value"
}
],
"last_updated":"09-09-2021T01:03:04.44Z"
}
Using PySpark I am able to read the csv and create a dataframe. I don't know how to group 500 records in a single json of the structure "entry": [ <500 records> ],"last_updated":"09-09-2021T01:03:04.44Z"
使用PySpark我能够读取 csv 并创建一个 dataframe。我不知道如何将 500 条记录分组到结构
"entry": [ <500 records> ],"last_updated":"09-09-2021T01:03:04.44Z"
I can use df.coalesce(1).write.option("maxRecordsPerFile",500)
but that will give me only the set of 500 records, without any structure.我可以使用
df.coalesce(1).write.option("maxRecordsPerFile",500)
但这只会给我 500 条记录的集合,没有任何结构。 I want those 500 records in the "entry"
list and "last_updated"
following it (which I am taking from datetime.now()
).我想要
"entry"
列表中的那 500 条记录和它后面的"last_updated"
(我从datetime.now()
中获取)。
You may try the following:您可以尝试以下操作:
NB.注意。 I've used the following imports.
我使用了以下导入。
from pyspark.sql import functions as F
from pyspark.sql import Window
1 . 1 . We need a column that can be used to split your data in 500 record batches
我们需要一个可用于将您的数据拆分为 500 个记录批次的列
(Recommended) We can create a pseudo column to achieve this with row_number
(推荐)我们可以创建一个伪列来实现这一点
row_number
df = df.withColumn("group_num",(F.row_number().over(Window.orderBy("row_id"))-1) % 500 )
otherwise, if row_id
starting at 1
is consistently increasing in the 5 million records, we may use that否则,如果从
1
开始的row_id
在 500 万条记录中持续增加,我们可以使用
df = df.withColumn("group_num",(F.col("row_id")-1) % 500 )
or in that odd chance that the column "last_updated":"09-09-2021T01:03:04.44Z"
is unique to each batch of 500 records或者在这种奇怪的情况下,
"last_updated":"09-09-2021T01:03:04.44Z"
列对于每批 500 条记录都是唯一的
df = df.withColumn("group_num",F.col("last_updated"))
2 . 2 . We will transform your dataset by grouping by the
group_num
我们将通过按
group_num
分组来转换您的数据集
df = (
df.groupBy("group_num")
.agg(
F.collect_list(
F.expr("struct(row_id,col1,col2)")
).alias("entries")
)
.withColumn("last_updated",F.lit(datetime.now())))
.drop("group_num")
)
NB.注意。 If you would like to include all columns you may use
F.expr("struct(*)")
instead of F.expr("struct(row_id,col1,col2)")
.如果您想包括所有列,您可以使用
F.expr("struct(*)")
而不是F.expr("struct(row_id,col1,col2)")
。
3 . 3 . Finally you can write to your output/destination with the option
.option("maxRecordsPerFile",1)
since each row now stores at most 500 entries最后,您可以使用选项
.option("maxRecordsPerFile",1)
写入输出/目标,因为现在每行最多存储 500 个条目
Eg.例如。
df.write.format("json").option("maxRecordsPerFile",1).save("<your intended path here>")
Let me know if this works for you让我知道这是否适合你
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.