如何从Pyspark中的另一列获取带有值列表的列

Question

有人可以帮助我了解如何使用每个人的所有收件人创建 pyspark DataFrame 吗？

例如：

输入数据帧：

+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

结果：

+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

我试过df.groupBy("Sender").sum("Recepients")来获取字符串并将其拆分，但错误Aggregation function can only be applied on a numeric column.

Answer 1

您需要做的就是创建groupBy Sender列并collect the Recepient groupBy Sender 。

下面是完整的解决方案

# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")], 
("sender","Recepient"))

df.show()

# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

# Import functions
import pyspark.sql.functions as f

# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

如何从Pyspark中的另一列获取带有值列表的列

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-29 19:19:49

如何从Pyspark中的另一列获取带有值列表的列

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-29 19:19:49

解决方案1
1 已采纳 2021-06-29 19:19:49