How to get column with list of values from another column in Pyspark

Question

Can someone help me with any idea how to create pyspark DataFrame with all Recepients of each person?

For example:

Input DataFrame:

+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

Result:

+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

I tried df.groupBy("Sender").sum("Recepients") to get string and split it but had the error Aggregation function can only be applied on a numeric column.

Answer 1

All you need was to do was a groupBy Sender column and collect the Recepient .

Below is the full solution

# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")], 
("sender","Recepient"))

df.show()

# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

# Import functions
import pyspark.sql.functions as f

# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

How to get column with list of values from another column in Pyspark

Question

1 answers

solution1
1 ACCPTED 2021-06-29 19:19:49

How to get column with list of values from another column in Pyspark

Question

1 answers

solution1 1 ACCPTED 2021-06-29 19:19:49

solution1
1 ACCPTED 2021-06-29 19:19:49