简体   繁体   中英

How to get column with list of values from another column in Pyspark

Can someone help me with any idea how to create pyspark DataFrame with all Recepients of each person?

For example:

Input DataFrame:

+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

Result:

+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

I tried df.groupBy("Sender").sum("Recepients") to get string and split it but had the error Aggregation function can only be applied on a numeric column.

All you need was to do was a groupBy Sender column and collect the Recepient .

Below is the full solution

# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")], 
("sender","Recepient"))

df.show()

# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

# Import functions
import pyspark.sql.functions as f

# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM