[英]Merging a dataframes where a column is a variable struct - Pyspark
我有一堆需要合並的數據框,它們具有的 4 列相同,但其中一列(參數)根據數據框具有可變字段,我在下面顯示了示例:
+---------+-----------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation |params |timestamp |
+---------+-----------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile |CREATE_CARD|[50d966f2-2820-441a-afbe-851e45eeb13e, 1s9miu7t6an50fplvvhybow6edx9_STG, 993270335, CREATED_CARD, 8236961209881953, kobo] |2020-02-24 03:07:04|
+---------+-----------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
+---------+---------+--------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+--------------------------------------------------------------------------------------------------+-------------------+
|profile |UPDATE |[0792b8d1-7ad9-43fc-9e75-9b1f2612834c, rkm9a7mescuwp0s4i01zlwi2ftu9_STG, 993270329, primary_email]|2020-02-12 18:13:08|
+---------+---------+--------------------------------------------------------------------------------------------------+-------------------+
+---------+---------+-----------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+-----------------------------------------------------------------------------------+-------------------+
|member |CREATE |[ea8e7e39-4a0a-4d41-b47e-70c8e56a2bca, h4m015wf1qxwrogj6d9l2uc5bsa9_STG, 993270331]|2020-01-02 09:51:32|
+---------+---------+-----------------------------------------------------------------------------------+-------------------+
如何在不向缺失字段添加空值的情況下將這些數據幀中的所有行轉換為單個數據幀? 我必須合並數據幀以存儲按字段時間戳排序的最終數據幀。 我不想將 params 保存為字符串,因為我需要將最終合並的數據幀存儲為文本中的 JSON 並將其保存為字符串會將轉義字符添加到最終文件中,這是我試圖避免的。
我嘗試使用toJSON()
將數據幀轉換為 JSON 對象,然后將其合並,但是toJSON()
給了我一個包含字符串類型元素的 RDD,我無法對其進行排序。 我也嘗試過union
,但這不起作用,因為在上面顯示的每個數據幀中,列 'params' 是不同的結構。 執行此操作的最有效方法是什么?
最終輸出應如下所示:
+---------+-----------+--------------------+-------------------------------------------------------------------------------------------------------------------------+
|attribute|operation |timestamp |params
+---------+-----------+--------------------+-------------------------------------------------------------------------------------------------------------------------+
|profile |CREATE_CARD|2020-02-24 03:07:04 |[50d966f2-2820-441a-afbe-851e45eeb13e, 1s9miu7t6an50fplvvhybow6edx9_STG, 993270335, CREATED_CARD, 8236961209881953, kobo]|
|profile |UPDATE |2020-02-12 18:13:08 |[0792b8d1-7ad9-43fc-9e75-9b1f2612834c, rkm9a7mescuwp0s4i01zlwi2ftu9_STG, 993270329, primary_email]|
|member |CREATE |2020-01-02 09:51:32 |[ea8e7e39-4a0a-4d41-b47e-70c8e56a2bca, h4m015wf1qxwrogj6d9l2uc5bsa9_STG, 993270331]
您可以使用“unionByName”(來自 2.3) https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html
喜歡:
from functools import reduce
from pyspark.sql import DataFrame
dfs = [df1, df2, df3]
df = reduce(DataFrame.unionByName, dfs)
如果您的 Spark 版本低於 2.3,您可以使用“union”,但要注意列順序。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.