如何在pyspark中合并列的值

Question

I have a dataframe. 我有一个数据框。 A column "names" contains columns headers, which values should be concated. 列“名称”包含列标题，应将其值隐藏。 I want to do it with pyspark concat_ws() but nothing works. 我想用pyspark concat_ws（）来做，但是没有用。 I must solve it with concat_ws() function, no pandas etc. 我必须使用concat_ws（）函数来解决它，没有熊猫等。

The best what I got it was concated headers, but not values in this columns. 我得到的最好的是缩略标题，而不是此列中的值。 I couldn't return list from function to unpack it in concat_ws() 我无法从函数返回列表以在concat_ws（）中解压缩它

map_cols = {'a':'newA', 'b':'newB', 'c':'newC', 'd':'newD'}

@udf
def get_result(names_col):
    headers = []
    for i in names_col:
        headers.append(map_cols[i])
    return headers

df = df.withColumn('names_arr', split('names', '_')).withColumn('result', concat_ws(';', get_result(col('names_arr'))))

Input dataframe:

names   | newA|newB|newC|newD
---------------------------
a_b     |1    | 2  | 7  |8
---------------------------
a_b_c   |2    | 3  | 4  |4
---------------------------
a_b_c_d |3    | 2  |4   |4
---------------------------
c_d     | 89  |  5 |3   |5
---------------------------
b_c_d   |  7  |5   |6   | 5


Expected output dataframe

names   | newA|newB|newC|newD|result
--------------------------------------
a_b     |1    | 2  | 7  | 8  |1;2
--------------------------------------
a_b_c   |2    | 3  | 4  |4   |2;3;4
--------------------------------------
a_b_c_d |3    | 2  |4   |4   |2;3;4;4
--------------------------------------
c_d     |89   |  5 |3   |5   |3;5
--------------------------------------
b_c_d   |7    |5   |6   | 5  |5;6;5

Answer 1

I am assuming that in your expected output colA is a typo for the last two rows (89 and 7) 我假设在您的预期输出中， colA是最后两行（89和7）的错字

You can iterate the dataframe.columns and perform concat_ws 您可以遍历dataframe.columns和执行concat_ws

# Skip data prepare

#import 
import pyspark.sql.functions as f

df.show()
+-------+----+----+----+----+
|  names|newA|newB|newC|newD|
+-------+----+----+----+----+
|    a_b|   1|   2|null|null|
|  a_b_c|   2|   3|   4|null|
|a_b_c_d|   3|   2|   4|   4|
|    c_d|null|null|   3|   5|
|  b_c_d|null|   5|   6|   5|
+-------+----+----+----+----+

Filetring column name if its names and the concat by ; Filetring列名称（如果其names与连接符通过; separator 分隔器

df.withColumn('result', f.concat_ws(';', *[c for c in df.columns if c!='names'])).show()
+-------+----+----+----+----+-------+
|  names|newA|newB|newC|newD| result|
+-------+----+----+----+----+-------+
|    a_b|   1|   2|null|null|    1;2|
|  a_b_c|   2|   3|   4|null|  2;3;4|
|a_b_c_d|   3|   2|   4|   4|3;2;4;4|
|    c_d|null|null|   3|   5|    3;5|
|  b_c_d|null|   5|   6|   5|  5;6;5|
+-------+----+----+----+----+-------+

如何在pyspark中合并列的值

问题描述

1 个解决方案

解决方案1
0 2019-09-03 01:30:28

如何在pyspark中合并列的值

问题描述

1 个解决方案

解决方案1 0 2019-09-03 01:30:28

解决方案1
0 2019-09-03 01:30:28