[英]How do I count based on different rows conditions in PySpark?
I have the following Dataframe:我有以下数据框:
ID ![]() |
Payment![]() |
Value![]() |
Date![]() |
---|---|---|---|
1 ![]() |
Cash![]() |
200 ![]() |
2020-01-01 ![]() |
1 ![]() |
Credit Card![]() |
500 ![]() |
2020-01-06 ![]() |
2 ![]() |
Cash![]() |
300 ![]() |
2020-02-01 ![]() |
3 ![]() |
Credit Card![]() |
400 ![]() |
2020-02-02 ![]() |
3 ![]() |
Credit Card![]() |
500 ![]() |
2020-01-03 ![]() |
3 ![]() |
Cash![]() |
200 ![]() |
2020-01-04 ![]() |
What I'd like to do is to count how many ID's have used both Cash and Credit Card.我想做的是计算有多少身份证同时使用了现金和信用卡。
For example, in this case there would be 2 ID's that used both Cash and Credit Card.例如,在这种情况下,将有 2 个同时使用现金和信用卡的 ID。
How would I do that on PySpark?我将如何在 PySpark 上做到这一点?
You can use collect_set
to count how many payment methods each user has.您可以使用
collect_set
来计算每个用户有多少种付款方式。
from pyspark.sql import functions as F
(df
.groupBy('ID')
.agg(F.collect_set('Payment').alias('methods'))
.withColumn('methods_size', F.size('methods'))
.show()
)
# +---+-------------------+------------+
# | ID| methods|methods_size|
# +---+-------------------+------------+
# | 1|[Credit Card, Cash]| 2|
# | 3|[Credit Card, Cash]| 2|
# | 2| [Cash]| 1|
# +---+-------------------+------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.