如何比较两个火花数据帧并获取所有缺失值

Question

我有 2 个数据框：

数据框A：

column1 column2 column3
   a    25  ast
   b        phone
   c    32  dlp
   d    45  
   e    60  phq

数据框 B：

column1 column2 column3
       a    25  ast
       b    15  phone
       c    32  dlp
       d    45  rare
       e    60  phq

我想比较两个数据帧并将所有缺失值写入另一个数据帧：

输出应该像数据框 C:

      column1   column2 column3
           a        
           b    15  
           c        
           d             rare
           e

缺失值可以在数据框 A 或数据框 B 中，如何使用 pyspark 数据框来实现？

Answer 1

假设您的数据集 1 被称为first ，数据集 2 被称为second 。

// we first have to rename the columns so we can distinguish which one belongs to which
first = first
  .withColumnRenamed("column2", "fir_column2")
  .withColumnRenamed("column3", "fir_column3")

second = second
  .withColumnRenamed("column2", "sec_column2")
  .withColumnRenamed("column3", "sec_column3")

然后我们收集列：

val pairs = first
  // join the tables on the unique key, in our case, `column1`
  .join(second, Seq("column1"), "left")
  // collect the pairs to array, so we can check for nulls
  .withColumn("column2_pairs", array("fir_column2", "sec_column2"))
  .withColumn("column3_pairs", array("fir_column3", "sec_column3"))
  // we then select just the arrays, for the sake of explanation
  .select("column1", "column2_pairs", "column3_pairs")

到目前为止的输出：

+-------+-------------+--------------+
|column1|column2_pairs|column3_pairs |
+-------+-------------+--------------+
|a      |[25, 25]     |[ast, ast]    |
|b      |[null, 15]   |[phone, phone]|
|c      |[32, 32]     |[dlp, dlp]    |
|d      |[45, 45]     |[null, rare]  |
|e      |[60, 60]     |[phq, phq]    |
+-------+-------------+--------------+

我们现在要做的是对列column2_pairs和column3_pairs进行操作，如果有一个为空，则选择另一个值。 这可以通过以下方式完成：

val newTable = pairs
  // in the collected array, if one of them is null, select the other value (for column2)
  .withColumn("missing_2", expr("filter(column2_pairs, x -> x is not null)"))
  .withColumn("missing_2_value", when(size(col("missing_2")).equalTo(1), col("missing_2").getItem(0)))

  // in the collected array, if one of them is null, select the other value (for column3)
  .withColumn("missing_3", expr("filter(column3_pairs, x -> x is not null)"))
  .withColumn("missing_3_value", when(size(col("missing_3")).equalTo(1), col("missing_3").getItem(0)))

  // do the final selection
  .select("column1", "missing_2_value", "missing_3_value")
  // renamings to get that to your desired format
  .withColumnRenamed("missing_2_value", "column2")
  .withColumnRenamed("missing_3_value", "column3")

最终输出：

+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
|a      |null   |null   |
|b      |15     |null   |
|c      |null   |null   |
|d      |null   |rare   |
|e      |null   |null   |
+-------+-------+-------+

祝你好运！

Answer 2

你可以这样做：

heywtu = pd.DataFrame(columns=['column1','column2','column3'],index=np.zeros(5))
for i in first.columns:
    for j in range(first.shape[0]):
        if first[i][j] == '' :
            heywtu[i][j] = second[i][j]
heywtu['column1'] = first['column1'].values
print(heywtu.fillna(' '))

输出：

>>>     column1 column2 column3
>>> 0.0     a             
>>> 0.0     b      15       
>>> 0.0     c             
>>> 0.0     d             rare
>>> 0.0     e

如何比较两个火花数据帧并获取所有缺失值

问题描述

2 个解决方案

解决方案1
1 2022-12-24 17:27:39

解决方案2
0 2022-12-24 17:45:18

如何比较两个火花数据帧并获取所有缺失值

问题描述

2 个解决方案

解决方案1 1 2022-12-24 17:27:39

解决方案2 0 2022-12-24 17:45:18

解决方案1
1 2022-12-24 17:27:39

解决方案2
0 2022-12-24 17:45:18