![](/img/trans.png)
[英]How do I add a new column to a Spark DataFrame (using PySpark)?
[英]How do I filter the column in pyspark?
我是 pyspark 的新手。 我想比较两张表。 如果其中一列中的值不匹配,我想在新列中打印出该列名。 使用, 比较两个数据帧 Pyspark链接,我能够得到这个结果。 现在,我想根据新创建的列过滤新表。
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
from pyspark.sql.functions import *
#from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df3 = df1.join(df2, "id").select(*select_expr)
df3.show()
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
这是我收到错误消息的步骤。
df3.filter(df3.column_names!="")
Error: cannot resolve '(column_names = '')' due to data type mismatch: differing types in '(column_names = '')' (array<string> and string).
我想要以下结果
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| DEF | 4000 | CAN | [address] |
| 2| GHI | 3500 | JPN | [sal] |
| 3| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
您可以创建一个 udf 来过滤并将相关的列名传递给它,我希望下面的代码会有所帮助。
from pyspark.sql import functions
simple filter function
@udf(returnType=BooleanType())
def my_filter(col1):
return True if len(col1) > 0 else False
df3.filter(my_filter(col('column_names'))).show()
另一种方式
#Do an outer join
new = df1.join(df2.alias('df2'), how='outer', on=['id','name','sal','Address'])
#Count disntict values in in each column per id
new1 =new.groupBy('id').agg(*[countDistinct(x).alias(f'{x}') for x in new.drop('id').columns])
#Using case when, where there is more than one distinct value, append column to new column
new2 = new1.select('id',array_except(array((*[when(col(c) != 1, lit(c)) for c in new1.drop('id').columns])),array(lit(None).cast('string'))).alias('column_names'))
#Join back to df2
df2.join(new2,how='right', on='id').show()
+---+-----+----+-------+------------+
| id| name| sal|Address|column_names|
+---+-----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKL_M|4800| CHN| [name, sal]|
+---+-----+----+-------+------------+
您收到错误,因为您将数组类型与字符串进行比较,您应该首先将 column_names 数组类型转换为字符串,然后它将起作用
df3 = df3.withColumn('column_names',concat_ws(";",col("column_names")))
使用filter('array_column != array()')
。 请参见下面过滤掉空 arrays 的示例。
spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
show()
# +------------+
# | arrcol|
# +------------+
# | []|
# |[blah, bleh]|
# +------------+
spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
filter('arrcol != array()'). \
show()
# +------------+
# | arrcol|
# +------------+
# |[blah, bleh]|
# +------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.