如何过滤 pyspark 中的列？

Question

我是 pyspark 的新手。 我想比较两张表。 如果其中一列中的值不匹配，我想在新列中打印出该列名。 使用，比较两个数据帧 Pyspark链接，我能够得到这个结果。 现在，我想根据新创建的列过滤新表。

df1 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "UK"],
  [3, "GHI", 3000, "JPN"],
  [4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])

df2 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "CAN"],
  [3, "GHI", 3500, "JPN"],
  [4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])


from pyspark.sql.functions import *
#from pyspark.sql.functions import col, array, when, array_remove

# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']

select_expr =[
                col("id"), 
                *[df2[c] for c in df2.columns if c != 'id'], 
                array_remove(array(*conditions_), "").alias("column_names")
]

df3 = df1.join(df2, "id").select(*select_expr)
df3.show()

DF3：

+------+---------+--------+------+--------------+
|   id | |name  | sal  | Address | column_names |
+------+---------+--------+------+--------------+
|     1|  ABC   | 5000 | US      |  []          |
|     2|  DEF   | 4000 | CAN     |  [address]   |
|     3|  GHI   | 3500 | JPN     |  [sal]       |
|     4|  JKL_M | 4800 | CHN     |  [name,sal]  |
+------+---------+--------+------+--------------+

这是我收到错误消息的步骤。

df3.filter(df3.column_names!="")

Error: cannot resolve '(column_names = '')' due to data type mismatch: differing types in '(column_names = '')' (array<string> and string).

我想要以下结果

DF3：

+------+---------+--------+------+--------------+
|   id | |name  | sal  | Address | column_names |
+------+---------+--------+------+--------------+      
|     1|  DEF   | 4000 | CAN     |  [address]   |
|     2|  GHI   | 3500 | JPN     |  [sal]       |
|     3|  JKL_M | 4800 | CHN     |  [name,sal]  |
+------+---------+--------+------+--------------+

Answer 1

您可以创建一个 udf 来过滤并将相关的列名传递给它，我希望下面的代码会有所帮助。

from pyspark.sql import functions

simple filter function
@udf(returnType=BooleanType())
def my_filter(col1):
  return True if len(col1) > 0 else False

df3.filter(my_filter(col('column_names'))).show()

Answer 2

另一种方式

#Do an outer join

new = df1.join(df2.alias('df2'), how='outer', on=['id','name','sal','Address'])

#Count disntict values in in each column per id

new1 =new.groupBy('id').agg(*[countDistinct(x).alias(f'{x}') for x in new.drop('id').columns])

#Using case when, where there is more than one distinct value, append column to new column
new2 = new1.select('id',array_except(array((*[when(col(c) != 1, lit(c)) for c in new1.drop('id').columns])),array(lit(None).cast('string'))).alias('column_names'))

#Join back to df2
df2.join(new2,how='right', on='id').show()


+---+-----+----+-------+------------+
| id| name| sal|Address|column_names|
+---+-----+----+-------+------------+
|  1|  ABC|5000|     US|          []|
|  2|  DEF|4000|    CAN|   [Address]|
|  3|  GHI|3500|    JPN|       [sal]|
|  4|JKL_M|4800|    CHN| [name, sal]|
+---+-----+----+-------+------------+

Answer 3

您收到错误，因为您将数组类型与字符串进行比较，您应该首先将 column_names 数组类型转换为字符串，然后它将起作用

df3 = df3.withColumn('column_names',concat_ws(";",col("column_names")))

Answer 4

使用filter('array_column != array()') 。 请参见下面过滤掉空 arrays 的示例。

spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
    show()

# +------------+
# |      arrcol|
# +------------+
# |          []|
# |[blah, bleh]|
# +------------+

spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
    filter('arrcol != array()'). \
    show()

# +------------+
# |      arrcol|
# +------------+
# |[blah, bleh]|
# +------------+

如何过滤 pyspark 中的列？

问题描述

4 个解决方案

解决方案1
0 2022-08-19 21:46:56

解决方案2
0 2022-08-20 02:07:46

解决方案3
0 已采纳 2022-08-20 04:45:27

解决方案4
0 2022-08-20 10:27:45

如何过滤 pyspark 中的列？

问题描述

4 个解决方案

解决方案1 0 2022-08-19 21:46:56

解决方案2 0 2022-08-20 02:07:46

解决方案3 0 已采纳 2022-08-20 04:45:27

解决方案4 0 2022-08-20 10:27:45

解决方案1
0 2022-08-19 21:46:56

解决方案2
0 2022-08-20 02:07:46

解决方案3
0 已采纳 2022-08-20 04:45:27

解决方案4
0 2022-08-20 10:27:45