简体   繁体   中英

How do I count based on different rows conditions in PySpark?

I have the following Dataframe:

ID Payment Value Date
1 Cash 200 2020-01-01
1 Credit Card 500 2020-01-06
2 Cash 300 2020-02-01
3 Credit Card 400 2020-02-02
3 Credit Card 500 2020-01-03
3 Cash 200 2020-01-04

What I'd like to do is to count how many ID's have used both Cash and Credit Card.

For example, in this case there would be 2 ID's that used both Cash and Credit Card.

How would I do that on PySpark?

You can use collect_set to count how many payment methods each user has.

from pyspark.sql import functions as F

(df
    .groupBy('ID')
    .agg(F.collect_set('Payment').alias('methods'))
    .withColumn('methods_size', F.size('methods'))
    .show()
)

# +---+-------------------+------------+
# | ID|            methods|methods_size|
# +---+-------------------+------------+
# |  1|[Credit Card, Cash]|           2|
# |  3|[Credit Card, Cash]|           2|
# |  2|             [Cash]|           1|
# +---+-------------------+------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM