I have the following Dataframe:
ID | Payment | Value | Date |
---|---|---|---|
1 | Cash | 200 | 2020-01-01 |
1 | Credit Card | 500 | 2020-01-06 |
2 | Cash | 300 | 2020-02-01 |
3 | Credit Card | 400 | 2020-02-02 |
3 | Credit Card | 500 | 2020-01-03 |
3 | Cash | 200 | 2020-01-04 |
What I'd like to do is to count how many ID's have used both Cash and Credit Card.
For example, in this case there would be 2 ID's that used both Cash and Credit Card.
How would I do that on PySpark?
You can use collect_set
to count how many payment methods each user has.
from pyspark.sql import functions as F
(df
.groupBy('ID')
.agg(F.collect_set('Payment').alias('methods'))
.withColumn('methods_size', F.size('methods'))
.show()
)
# +---+-------------------+------------+
# | ID| methods|methods_size|
# +---+-------------------+------------+
# | 1|[Credit Card, Cash]| 2|
# | 3|[Credit Card, Cash]| 2|
# | 2| [Cash]| 1|
# +---+-------------------+------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.