[英]How to get a list column with values of multiple columns given in another column in Pyspark Dataframe?
[英]How to get column with list of values from another column in Pyspark
有人可以帮助我了解如何使用每个人的所有收件人创建 pyspark DataFrame 吗?
例如:
输入数据帧:
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
结果:
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+
我试过df.groupBy("Sender").sum("Recepients")
来获取字符串并将其拆分,但错误Aggregation function can only be applied on a numeric column.
您需要做的就是创建groupBy Sender
列并collect the Recepient
groupBy Sender
。
下面是完整的解决方案
# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")],
("sender","Recepient"))
df.show()
# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
# Import functions
import pyspark.sql.functions as f
# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.