how to concat values of columns with same name in pyspark

Question

We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column

for example:

Request  for input table named ages:


+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25|  1  |  2  |  3  |
+---+----+------+-----+
| 26|  4  |  5  |  6  |
+---+----+------+-----+


the output table  is :

+---+----+------+-----+
|age|       ids       |
+---+----+------+-----+
| 25| [1  ,  2  ,  3] |
+---+----+------+-----+
| 26| [4  ,  5  ,  6] |
+---+----+------+-----+


next time  we might get a request for input table names:

+---+----+------+-----+
|name| company | company| 
+---+----+------+-----+
| abc|  a      |  b     | 
+---+----+------+-----+
| xyc|  c      |  d     |  
+---+----+------+-----+

The output table should be:

+---+----+------+
|name| company  | 
+---+----+------+
| abc|  [a,b]   | 
+---+----+------+
| xyc|  [c,d]   |  
+---+----+------+

So Basically I need to find the columns with the same name and then merge the values in them.

Answer 1

You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.

I have added necessary comments for clarity.

Using Pandas:

import pandas as pd
from collections import Counter

pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe

pd_df.head()

def concatDuplicateColumns(df):
    duplicate_cols = [] #to store duplicate column names
    for col in dict(Counter(df.columns)):
        if dict(Counter(df.columns))[col] >1:
            duplicate_cols.append(col)

    final_dict = {}
    for cols in duplicate_cols:
        final_dict[cols] = []  #initialize dict

    for cols in duplicate_cols:
        for ind in df.index.tolist():
            final_dict[cols].append(df.loc[ind, cols].tolist())

    df.drop(duplicate_cols, axis=1, inplace=True)
    for cols in duplicate_cols:
        df[cols] = final_dict[cols]
    return df

final_df = concatDuplicateColumns(pd_df)

spark_df = spark.createDataFrame(final_df)

spark_df.show()

how to concat values of columns with same name in pyspark

Question

1 answers

solution1
0 2021-06-24 03:30:48

how to concat values of columns with same name in pyspark

Question

1 answers

solution1 0 2021-06-24 03:30:48

solution1
0 2021-06-24 03:30:48