比较两个数据帧 Pyspark

Question

I'm trying to compare two data frames with have same number of columns ie 4 columns with id as key column in both data frames我正在尝试比较两个具有相同列数的数据框，即 4 列的 id 作为两个数据框中的关键列

df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")

Now I want to append new column to DF2 ie column_names which is the list of the columns with different values than df1现在我想将新列附加到 DF2，即 column_names，它是值与 df1 不同的列的列表

df2.withColumn("column_names",udf())

DF1 DF1

+------+---------+--------+------+
|   id | |name  | sal  | Address |
+------+---------+--------+------+
|     1|  ABC   | 5000 | US      |
|     2|  DEF   | 4000 | UK      |
|     3|  GHI   | 3000 | JPN     |
|     4|  JKL   | 4500 | CHN     |
+------+---------+--------+------+

DF2: DF2：

+------+---------+--------+------+
|   id | |name  | sal  | Address |
+------+---------+--------+------+
|     1|  ABC   | 5000 | US      |
|     2|  DEF   | 4000 | CAN     |
|     3|  GHI   | 3500 | JPN     |
|     4|  JKL_M | 4800 | CHN     |
+------+---------+--------+------+

Now I want DF3现在我想要DF3

DF3: DF3：

+------+---------+--------+------+--------------+
|   id | |name  | sal  | Address | column_names |
+------+---------+--------+------+--------------+
|     1|  ABC   | 5000 | US      |  []          |
|     2|  DEF   | 4000 | CAN     |  [address]   |
|     3|  GHI   | 3500 | JPN     |  [sal]       |
|     4|  JKL_M | 4800 | CHN     |  [name,sal]  |
+------+---------+--------+------+--------------+

I saw this SO question, How to compare two dataframe and print columns that are different in scala .我看到了这个问题， How to compare two dataframe and print columns that are different in scala . Tried that, however the result is different.试过了，结果不一样。

I'm thinking of going with a UDF function by passing row from each dataframe to udf and compare column by column and return column list.我正在考虑通过将每个数据帧中的行传递给 udf 并逐列比较并返回列列表来使用 UDF 函数。 However for that both the data frames should be in sorted order so that same id rows will be sent to udf.然而，这两个数据帧都应该按顺序排列，以便将相同的 id 行发送到 udf。 Sorting is costly operation here.排序在这里是代价高昂的操作。 Any solution?有什么解决办法吗？

Answer 1

Assuming that we can use id to join these two datasets I don't think that there is a need for UDF.假设我们可以使用 id 来连接这两个数据集，我认为不需要 UDF。 This could be solved just by using inner join, array and array_remove functions among others.这可以通过使用内部连接、数组和array_remove 等函数来解决。

First let's create the two datasets:首先让我们创建两个数据集：

df1 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "UK"],
  [3, "GHI", 3000, "JPN"],
  [4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])

df2 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "CAN"],
  [3, "GHI", 3500, "JPN"],
  [4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])

First we do an inner join between the two datasets then we generate the condition df1[col] != df2[col] for each column except id .首先我们在两个数据集之间进行内部连接，然后我们为除id之外的每一列生成条件df1[col] != df2[col] 。 When the columns aren't equal we return the column name otherwise an empty string.当列不相等时，我们返回列名，否则返回一个空字符串。 The list of conditions will consist the items of an array from which finally we remove the empty items:条件列表将由数组的项组成，最后我们从中删除空项：

from pyspark.sql.functions import col, array, when, array_remove

# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']

select_expr =[
                col("id"), 
                *[df2[c] for c in df2.columns if c != 'id'], 
                array_remove(array(*conditions_), "").alias("column_names")
]

df1.join(df2, "id").select(*select_expr).show()

# +---+-----+----+-------+------------+
# | id| name| sal|Address|column_names|
# +---+-----+----+-------+------------+
# |  1|  ABC|5000|     US|          []|
# |  3|  GHI|3500|    JPN|       [sal]|
# |  2|  DEF|4000|    CAN|   [Address]|
# |  4|JKL_M|4800|    CHN| [name, sal]|
# +---+-----+----+-------+------------+

Answer 2

Python: PySpark version of my previous scala code. Python：我之前的 Scala 代码的 PySpark 版本。

import pyspark.sql.functions as f

df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")

columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")

for name in columns:
    df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))


df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()

Scala: Here is my best approach for your problem. Scala：这是我解决您问题的最佳方法。

val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")

val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")

columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
  .withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
  .show(false)

First, I join two dataframe into df3 and used the columns from df1 .首先，我将两个数据帧加入df3并使用来自df1的列。 By folding left to the df3 with temp columns that have the value for column name when df1 and df2 has the same id and other column values.当df1和df2具有相同的id和其他列值时，通过使用具有列名称值的临时列向左折叠到df3 。

After that, concat_ws for those column names and the null's are gone away and only the column names are left.之后，这些列名的concat_ws和空值都消失了，只剩下列名。

+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1  |ABC |5000|US     |            |
|2  |DEF |4000|UK     |Address     |
|3  |GHI |3000|JPN    |sal         |
|4  |JKL |4500|CHN    |name,sal    |
+---+----+----+-------+------------+

The only thing different from your expected result is that the output is not a list but string.与您的预期结果唯一不同的是输出不是列表而是字符串。

ps I forgot to use PySpark but this is the normal spark, sorry. ps 我忘记使用 PySpark 但这是正常的火花，抱歉。

Answer 3

Here is your solution with UDF , I have changed first dataframe name dynamically so that it will be not ambiguous during check.这是您使用UDF的解决方案，我已经动态更改了第一个dataframe名称，以便在检查过程中不会出现歧义。 Go through below code and let me know in case any concerns.通过下面的代码，如果有任何问题，请告诉我。

>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
|  1| ABC|5000|     US|
|  2| DEF|4000|     UK|
|  3| GHI|3000|    JPN|
|  4| JKL|4500|    CHN|
+---+----+----+-------+

>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
|  1| ABC|5000|     US|
|  2| DEF|4000|    CAN|
|  3| GHI|3500|    JPN|
|  4|JKLM|4800|    CHN|
+---+----+----+-------+

>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")

 //udf declaration 

>>> def CheckMatch(Column,r):
...     check=''
...     ColList=Column.split(",")
...     for cc in ColList:
...             if(r[cc] != r["x_" + cc]):
...                     check=check + "," + cc
...     return check.replace(',','',1).split(",")

>>> CheckMatchUDF = udf(CheckMatch)

//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")

>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
       .select(finalCol)
       .show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
|  1| ABC|5000|     US|          []|
|  2| DEF|4000|    CAN|   [Address]|
|  3| GHI|3500|    JPN|       [sal]|
|  4|JKLM|4800|    CHN| [name, sal]|
+---+----+----+-------+------------+

Answer 4

You can get that query build for you in PySpark and Scala by the spark-extension package.您可以通过spark-extension包在 PySpark 和 Scala 中为您构建查询。 It provides the diff transformation that does exactly that.它提供了正是这样做的diff转换。

from gresearch.spark.diff import *

options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|diff|    changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|   N|         []|  1|      ABC|       ABC|    5000|     5000|          US|           US|
|   C|  [Address]|  2|      DEF|       DEF|    4000|     4000|          UK|          CAN|
|   C|      [sal]|  3|      GHI|       GHI|    3000|     3500|         JPN|          JPN|
|   C|[name, sal]|  4|      JKL|     JKL_M|    4500|     4800|         CHN|          CHN|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+

While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved.虽然这是一个简单的例子，但当涉及宽模式、插入、删除和空值时，差异数据帧可能会变得复杂。 That package is well-tested, so you don't have to worry about getting that query right yourself.该软件包已经过充分测试，因此您不必担心自己是否正确查询。

Answer 5

blow how it is working on pyspark吹嘘它是如何在 pyspark 上工作的

from gresearch.spark.diff import *从 gresearch.spark.diff 导入 *

options = DiffOptions().with_change_column('changes') df1.diff_with_options(df2, options, 'id').show() +----+-----------+---+---------+----------+--------+---------+------------+-------------+ |diff| options = DiffOptions().with_change_column('changes') df1.diff_with_options(df2, options, 'id').show() +----+-----------+---+ ---------+----------+--------+---------+---------- --+-------------+ |差异| changes|变化| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address| +----+-----------+---+---------+----------+--------+---------+------------+-------------+ | +----+-----------+---+---------+---------+------- -+---------+------------+------------+ | N| N| []| []| 1| 1| ABC| ABC| ABC| ABC| 5000| 5000| 5000| 5000| US|美国| US|美国| | | C| C| [Address]| [地址]| 2| 2| DEF|防御| DEF|防御| 4000| 4000| 4000| 4000| UK|英国| CAN|可以| | | C| C| [sal]| [萨尔]| 3| 3| GHI|全球健康指数| GHI|全球健康指数| 3000| 3000| 3500| 3500| JPN|日文| JPN|日文| | | C|[name, sal]| C|[姓名，萨尔]| 4| 4| JKL| JKL| JKL_M| JKL_M| 4500| 4500| 4800| 4800| CHN|中国| CHN|中国| +----+-----------+---+---------+----------+--------+---------+------------+-------------+ +----+-----------+---+---------+---------+------- -+---------+------------+------------+

比较两个数据帧 Pyspark

问题描述

5 个解决方案

解决方案1
8 已采纳 2020-02-18 15:54:37

解决方案2
4 2020-02-18 12:37:23

解决方案3
4 2020-02-18 13:07:33

解决方案4
0 2020-08-28 18:33:13

解决方案5
0 2021-07-27 13:24:28

比较两个数据帧 Pyspark

问题描述

5 个解决方案

解决方案1 8 已采纳 2020-02-18 15:54:37

解决方案2 4 2020-02-18 12:37:23

解决方案3 4 2020-02-18 13:07:33

解决方案4 0 2020-08-28 18:33:13

解决方案5 0 2021-07-27 13:24:28

解决方案1
8 已采纳 2020-02-18 15:54:37

解决方案2
4 2020-02-18 12:37:23

解决方案3
4 2020-02-18 13:07:33

解决方案4
0 2020-08-28 18:33:13

解决方案5
0 2021-07-27 13:24:28