繁体   English   中英

如何从Pyspark中的另一列获取带有值列表的列

[英]How to get column with list of values from another column in Pyspark

有人可以帮助我了解如何使用每个人的所有收件人创建 pyspark DataFrame 吗?

例如:

输入数据帧:

+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

结果:

+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

我试过df.groupBy("Sender").sum("Recepients")来获取字符串并将其拆分,但错误Aggregation function can only be applied on a numeric column.

您需要做的就是创建groupBy Sender列并collect the Recepient groupBy Sender

下面是完整的解决方案

# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")], 
("sender","Recepient"))

df.show()

# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob     |
|Alice | John    |
|Alice | Mike    |
|Bob   | Tom     |
|Bob   | George  |
|George| Alice   |
|George| Bob     |
+------+---------+

# Import functions
import pyspark.sql.functions as f

# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients        |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob   |[Tom, George]     |
|George|[Alice, Bob]      |
+------+------------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM