Can someone help me with any idea how to create pyspark DataFrame with all Recepients of each person?
For example:
Input DataFrame:
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
Result:
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+
I tried df.groupBy("Sender").sum("Recepients")
to get string and split it but had the error Aggregation function can only be applied on a numeric column.
All you need was to do was a groupBy Sender
column and collect the Recepient
.
Below is the full solution
# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")],
("sender","Recepient"))
df.show()
# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
# Import functions
import pyspark.sql.functions as f
# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.