简体   繁体   中英

how to concat values of columns with same name in pyspark

We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column

for example:

Request  for input table named ages:


+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25|  1  |  2  |  3  |
+---+----+------+-----+
| 26|  4  |  5  |  6  |
+---+----+------+-----+


the output table  is :

+---+----+------+-----+
|age|       ids       |
+---+----+------+-----+
| 25| [1  ,  2  ,  3] |
+---+----+------+-----+
| 26| [4  ,  5  ,  6] |
+---+----+------+-----+


next time  we might get a request for input table names:

+---+----+------+-----+
|name| company | company| 
+---+----+------+-----+
| abc|  a      |  b     | 
+---+----+------+-----+
| xyc|  c      |  d     |  
+---+----+------+-----+

The output table should be:

+---+----+------+
|name| company  | 
+---+----+------+
| abc|  [a,b]   | 
+---+----+------+
| xyc|  [c,d]   |  
+---+----+------+

So Basically I need to find the columns with the same name and then merge the values in them.

You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.

I have added necessary comments for clarity.

Using Pandas:

import pandas as pd
from collections import Counter

pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe

pd_df.head()

def concatDuplicateColumns(df):
    duplicate_cols = [] #to store duplicate column names
    for col in dict(Counter(df.columns)):
        if dict(Counter(df.columns))[col] >1:
            duplicate_cols.append(col)

    final_dict = {}
    for cols in duplicate_cols:
        final_dict[cols] = []  #initialize dict

    for cols in duplicate_cols:
        for ind in df.index.tolist():
            final_dict[cols].append(df.loc[ind, cols].tolist())

    df.drop(duplicate_cols, axis=1, inplace=True)
    for cols in duplicate_cols:
        df[cols] = final_dict[cols]
    return df

final_df = concatDuplicateColumns(pd_df)

spark_df = spark.createDataFrame(final_df)

spark_df.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM