合并列是可变结构的数据帧 - Pyspark

Question

I have a bunch of dataframes that I need to merge, the 4 columns that they have are same but one column out of those (params) has variable fields in it depending on the dataframe, I've displayed examples below:我有一堆需要合并的数据框，它们具有的 4 列相同，但其中一列（参数）根据数据框具有可变字段，我在下面显示了示例：

+---------+-----------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation  |params                                                                                                                    |timestamp          |
+---------+-----------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile  |CREATE_CARD|[50d966f2-2820-441a-afbe-851e45eeb13e, 1s9miu7t6an50fplvvhybow6edx9_STG, 993270335, CREATED_CARD, 8236961209881953, kobo] |2020-02-24 03:07:04|
+---------+-----------+--------------------------------------------------------------------------------------------------------------------------+-------------------+

+---------+---------+--------------------------------------------------------------------------------------------------+-------------------+

|attribute|operation|params                                                                                            |timestamp          |
+---------+---------+--------------------------------------------------------------------------------------------------+-------------------+
|profile  |UPDATE   |[0792b8d1-7ad9-43fc-9e75-9b1f2612834c, rkm9a7mescuwp0s4i01zlwi2ftu9_STG, 993270329, primary_email]|2020-02-12 18:13:08|
+---------+---------+--------------------------------------------------------------------------------------------------+-------------------+

+---------+---------+-----------------------------------------------------------------------------------+-------------------+
|attribute|operation|params                                                                             |timestamp          |
+---------+---------+-----------------------------------------------------------------------------------+-------------------+
|member   |CREATE   |[ea8e7e39-4a0a-4d41-b47e-70c8e56a2bca, h4m015wf1qxwrogj6d9l2uc5bsa9_STG, 993270331]|2020-01-02 09:51:32|
+---------+---------+-----------------------------------------------------------------------------------+-------------------+

How do I get all the rows in these dataframes into a single dataframe without adding null values to the missing fields?如何在不向缺失字段添加空值的情况下将这些数据帧中的所有行转换为单个数据帧？ I have to merge dataframes to store the final dataframe sorted on the field timestamp.我必须合并数据帧以存储按字段时间戳排序的最终数据帧。 I don't want to save params as a string as I need to store the final merged dataframe as a JSON in text and saving it as a string will add escaped characters to the final file, which I'm trying to avoid.我不想将 params 保存为字符串，因为我需要将最终合并的数据帧存储为文本中的 JSON 并将其保存为字符串会将转义字符添加到最终文件中，这是我试图避免的。

I tried converting the Dataframes to JSON object using toJSON() and then merging it, but toJSON() gave me a RDD with elements of string type, which I can't sort on.我尝试使用toJSON()将数据帧转换为 JSON 对象，然后将其合并，但是toJSON()给了我一个包含字符串类型元素的 RDD，我无法对其进行排序。 I also tried union , but that didn't work as the column 'params' is a different struct in each of the dataframes shown above.我也尝试过union ，但这不起作用，因为在上面显示的每个数据帧中，列 'params' 是不同的结构。 What is the most efficient way to do this?执行此操作的最有效方法是什么？

Final Output should look like this:最终输出应如下所示：

+---------+-----------+--------------------+-------------------------------------------------------------------------------------------------------------------------+
|attribute|operation  |timestamp           |params                                                    
+---------+-----------+--------------------+-------------------------------------------------------------------------------------------------------------------------+
|profile  |CREATE_CARD|2020-02-24 03:07:04 |[50d966f2-2820-441a-afbe-851e45eeb13e, 1s9miu7t6an50fplvvhybow6edx9_STG, 993270335, CREATED_CARD, 8236961209881953, kobo]|
|profile  |UPDATE     |2020-02-12 18:13:08 |[0792b8d1-7ad9-43fc-9e75-9b1f2612834c, rkm9a7mescuwp0s4i01zlwi2ftu9_STG, 993270329, primary_email]|
|member   |CREATE     |2020-01-02 09:51:32 |[ea8e7e39-4a0a-4d41-b47e-70c8e56a2bca, h4m015wf1qxwrogj6d9l2uc5bsa9_STG, 993270331]

Answer 1

You can use "unionByName" (from 2.3) https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html您可以使用“unionByName”（来自 2.3） https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html

like:喜欢：

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1, df2, df3]
df = reduce(DataFrame.unionByName, dfs)

If you have a Spark version lower than 2.3 you can use "union", but beware of column order.如果您的 Spark 版本低于 2.3，您可以使用“union”，但要注意列顺序。

合并列是可变结构的数据帧 - Pyspark

问题描述

1 个解决方案

解决方案1
0 2020-04-02 14:00:36

合并列是可变结构的数据帧 - Pyspark

问题描述

1 个解决方案

解决方案1 0 2020-04-02 14:00:36

解决方案1
0 2020-04-02 14:00:36