[英]How to filter data from a dataframe using pyspark
I have a table named mytable as dataframe available, and below is the table我有一个名为 mytable 的表作为可用的数据框,下面是表
[+---+----+----+----+ | x| y| z| w| +---+----+----+----+ | 1| a|null|null| | 1|null| b|null| | 1|null|null| c| | 2| d|null|null| | 2|null| e|null| | 2|null|null| f| +---+----+----+----+]
I want result where we group by col x and concatenate result of col y,z,w.我想要我们按 col x 分组并连接 col y,z,w 的结果的结果。 The result looks as below.结果如下所示。
[+---+----+----+- | x| result| +---+----+----+ | 1| a b c | | 2| d e f | +---+----+---+|
Hope this helps!希望这可以帮助!
from pyspark.sql.functions import concat_ws, collect_list, concat, coalesce, lit
#sample data
df = sc.parallelize([
[1, 'a', None, None],
[1, None, 'b', None],
[1, None, None, 'c'],
[2, 'd', None, None],
[2, None, 'e', None],
[2, None, None, 'f']]).\
toDF(('x', 'y', 'z', 'w'))
df.show()
result_df = df.groupby("x").\
agg(concat_ws(' ', collect_list(concat(*[coalesce(c, lit("")) for c in df.columns[1:]]))).
alias('result'))
result_df.show()
Output is:输出是:
+---+------+
| x|result|
+---+------+
| 1| a b c|
| 2| d e f|
+---+------+
Sample input:样本输入:
+---+----+----+----+
| x| y| z| w|
+---+----+----+----+
| 1| a|null|null|
| 1|null| b|null|
| 1|null|null| c|
| 2| d|null|null|
| 2|null| e|null|
| 2|null|null| f|
+---+----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.