简体   繁体   English

如何在pyspark中连接同名列的值

[英]how to concat values of columns with same name in pyspark

We have a feature request where we want to pull a table as per request from the database and perform some transformation on it.我们有一个功能请求,我们想根据请求从数据库中提取一个表并对其执行一些转换。 But these tables may have duplicate columns [columns with same name].但是这些表可能有重复的列[同名列]。 I want to combine these columns into a single column我想将这些列合并为一列

for example:例如:

Request  for input table named ages:


+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25|  1  |  2  |  3  |
+---+----+------+-----+
| 26|  4  |  5  |  6  |
+---+----+------+-----+


the output table  is :

+---+----+------+-----+
|age|       ids       |
+---+----+------+-----+
| 25| [1  ,  2  ,  3] |
+---+----+------+-----+
| 26| [4  ,  5  ,  6] |
+---+----+------+-----+


next time  we might get a request for input table names:

+---+----+------+-----+
|name| company | company| 
+---+----+------+-----+
| abc|  a      |  b     | 
+---+----+------+-----+
| xyc|  c      |  d     |  
+---+----+------+-----+

The output table should be:

+---+----+------+
|name| company  | 
+---+----+------+
| abc|  [a,b]   | 
+---+----+------+
| xyc|  [c,d]   |  
+---+----+------+

So Basically I need to find the columns with the same name and then merge the values in them.所以基本上我需要找到具有相同名称的列,然后合并其中的值。

You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.您可以将 spark 数据帧转换为 pandas 数据帧,执行必要的操作并将其转换回 spark 数据帧。

I have added necessary comments for clarity.为了清楚起见,我添加了必要的注释。

Using Pandas:使用熊猫:

import pandas as pd
from collections import Counter

pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe

pd_df.head()

def concatDuplicateColumns(df):
    duplicate_cols = [] #to store duplicate column names
    for col in dict(Counter(df.columns)):
        if dict(Counter(df.columns))[col] >1:
            duplicate_cols.append(col)

    final_dict = {}
    for cols in duplicate_cols:
        final_dict[cols] = []  #initialize dict

    for cols in duplicate_cols:
        for ind in df.index.tolist():
            final_dict[cols].append(df.loc[ind, cols].tolist())

    df.drop(duplicate_cols, axis=1, inplace=True)
    for cols in duplicate_cols:
        df[cols] = final_dict[cols]
    return df

final_df = concatDuplicateColumns(pd_df)

spark_df = spark.createDataFrame(final_df)

spark_df.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM