如何使用pyspark从数据框中过滤数据

Question

I have a table named mytable as dataframe available, and below is the table我有一个名为 mytable 的表作为可用的数据框，下面是表

[+---+----+----+----+
|  x|   y|   z|   w| 
+---+----+----+----+
|  1|   a|null|null|
|  1|null|   b|null|
|  1|null|null|   c|
|  2|   d|null|null|
|  2|null|   e|null|
|  2|null|null|   f|
+---+----+----+----+]

I want result where we group by col x and concatenate result of col y,z,w.我想要我们按 col x 分组并连接 col y,z,w 的结果的结果。 The result looks as below.结果如下所示。

[+---+----+----+-
|  x|   result|     
+---+----+----+
|  1|   a b c |
|  2|   d e f |
+---+----+---+|

Answer 1

Hope this helps!希望这可以帮助！

from pyspark.sql.functions import concat_ws, collect_list, concat, coalesce, lit

#sample data
df = sc.parallelize([
    [1, 'a', None, None],
    [1, None, 'b', None],
    [1, None, None, 'c'],
    [2, 'd', None, None],
    [2, None, 'e', None],
    [2, None, None, 'f']]).\
    toDF(('x', 'y', 'z', 'w'))
df.show()

result_df = df.groupby("x").\
               agg(concat_ws(' ', collect_list(concat(*[coalesce(c, lit("")) for c in df.columns[1:]]))).
                   alias('result'))
result_df.show()

Output is:输出是：

+---+------+
|  x|result|
+---+------+
|  1| a b c|
|  2| d e f|
+---+------+

Sample input:样本输入：

+---+----+----+----+
|  x|   y|   z|   w|
+---+----+----+----+
|  1|   a|null|null|
|  1|null|   b|null|
|  1|null|null|   c|
|  2|   d|null|null|
|  2|null|   e|null|
|  2|null|null|   f|
+---+----+----+----+

如何使用pyspark从数据框中过滤数据

问题描述

1 个解决方案

解决方案1
1 2018-02-14 18:40:03

如何使用pyspark从数据框中过滤数据

问题描述

1 个解决方案

解决方案1 1 2018-02-14 18:40:03

解决方案1
1 2018-02-14 18:40:03