在两个数据框列之间显示唯一值-pyspark

Question

say I have two "ID" columns in 2 dataframes, I want to display ID from DF1 that doesnt exists in DF2 说我在2个数据帧中有两个“ ID”列，我想显示DF1中不存在的DF1中的ID

I dont know if I should use join, merge, or isin. 我不知道该使用联接，合并还是isin。

cond = [df.name != df3.name]
df.join(df3, cond, 'outer').select(df.name, df3.age).collect()

not sure if changing the condition will give me the result. 不知道如果改变条件会给我结果。

Answer 1

In pyspark,you can use leftanti join, 在pyspark中，您可以使用leftanti连接，

>>> df1 = spark.createDataFrame([(0,'val1'),(1,'val2'),(4,'val4')],['id','val'])    
>>> df1.show()
+---+----+
| id| val|
+---+----+
|  0|val1|
|  1|val2|
|  4|val4|
+---+----+

>>> df2 = spark.createDataFrame([(0,'val1'),(1,'val2'),(3,'val3'),(2,'val2')],['id','val'])
>>> df2.show()
+---+----+
| id| val|
+---+----+
|  0|val1|
|  1|val2|
|  3|val3|
|  2|val2|
+---+----+

>>> df1.join(df2,'id','leftanti').show()
+---+----+
| id| val|
+---+----+
|  4|val4|
+---+----+

Similarly, 同样，

>>> df2.join(df1,'id','leftanti').show()
+---+----+
| id| val|
+---+----+
|  3|val3|
|  2|val2|
+---+----+

Answer 2

use isin and fir ~df1['id] for dataframe compare. 使用isin和~df1['id]进行数据帧比较。

df1: df1：

df2: df2：

id name
0   1   aa
1   5   bb
2   2   cc
3  10   dd

result = df1.loc[~df1['id'].isin(df2['id'])]

result

    id name
2   3    c
3   4    d

hope this answer is helpful. 希望这个答案有帮助。

Answer 3

Here's a code snippet which uses isin from pyspark.sql to filter the ids that you are not interested. 这是一个使用pyspark.sql中的isin过滤您不感兴趣的ID的代码片段。 map/lambda function is used to build a list of ids to be filtered. map / lambda函数用于构建要过滤的ID列表。

    from __future__ import print_function
    from pyspark.sql import SparkSession

    spark_session = SparkSession \
        .builder \
        .appName("test_isin") \
        .getOrCreate()

    dict1 = [[1,'a'], [2,'b'], [3,'c'], [4,'d']]
    dict2 = [[1, 'aa'], [5,'bb'], [2, 'cc'], [10, 'dd']]

    df1 = spark_session.createDataFrame(dict1, ["id", "name"])
    df2 = spark_session.createDataFrame(dict2, ["id", "name"])

    df2_id = df2.select(df2.id).collect()

    ids_to_be_filtered = []

    map(lambda each : ids_to_be_filtered.append(each.id), df2_id)

    result = df1[~df1.id.isin(df2_id)]

    result.show()

Also here's the link to documentation https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isin 这也是文档https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isin的链接

Please don't forget to let me know if it solved your problem :) 请不要忘记让我知道它是否解决了您的问题:)

在两个数据框列之间显示唯一值-pyspark

问题描述

3 个解决方案

解决方案1
3 2017-07-05 07:08:35

解决方案2
1 2017-07-05 04:20:19

解决方案3
0 2017-07-25 18:19:42

在两个数据框列之间显示唯一值-pyspark

问题描述

3 个解决方案

解决方案1 3 2017-07-05 07:08:35

解决方案2 1 2017-07-05 04:20:19

解决方案3 0 2017-07-25 18:19:42

解决方案1
3 2017-07-05 07:08:35

解决方案2
1 2017-07-05 04:20:19

解决方案3
0 2017-07-25 18:19:42