简体   繁体   English

pySpark计数ID符合条件

[英]pySpark count IDs on condition

I have the following dataset and working with PySpark 我有以下数据集并使用PySpark

df = sparkSession.createDataFrame([(5, 'Samsung', '2018-02-23'),
                                   (8, 'Apple', '2018-02-22'),
                                   (5, 'Sony', '2018-02-21'),
                                   (5, 'Samsung', '2018-02-20'),
                                   (8, 'LG', '2018-02-20')],
                                   ['ID', 'Product', 'Date']
                                  )

+---+-------+----------+
| ID|Product|      Date|
+---+-------+----------+
|  5|Samsung|2018-02-23|
|  8|  Apple|2018-02-22|
|  5|   Sony|2018-02-21|
|  5|Samsung|2018-02-20|
|  8|     LG|2018-02-20|
+---+-------+----------+
# Each ID will appear ALWAYS at least 2 times (do not consider the case of unique IDs in this df)

Each ID should increment the PRODUCT counter only when it represents the higher frequency. 每个ID仅在代表较高频率时才应递增PRODUCT计数器。 In case of equal frequency, the most recent date should decide which product receives +1. 如果频率相等,则最近的日期应决定哪个产品获得+1。

From the sample above, the desired output would be: 从上面的示例中,所需的输出将是:

+-------+-------+
|Product|Counter|
+-------+-------+
|Samsung|      1|
|  Apple|      1|
|   Sony|      0|
|     LG|      0|
+-------+-------+


# Samsung - 1 (preferred twice by ID=5)
# Apple - 1 (preferred by ID=8 more recently than LG)
# Sony - 0 (because ID=5 preferred Samsung 2 time, and Sony only 1) 
# LG - 0 (because ID=8 preferred Apple more recently) 

What is the most efficient way with PySpark to achieve this result? 使用PySpark达到此结果的最有效方法是什么?

IIUC, you want to pick the most frequent product for each ID , breaking ties using the most recent Date IIUC,您想为每个ID选择最频繁的商品,并使用最新的Date打破平局

So first, we can get the count for each product/ID pair using: 因此,首先,我们可以使用以下方法获得每个产品/ ID对的计数:

import pyspark.sql.functions as f
from pyspark.sql import Window

df = df.select(
    'ID',
    'Product',
    'Date', 
    f.count('Product').over(Window.partitionBy('ID', 'Product')).alias('count')
)
df.show()
#+---+-------+----------+-----+
#| ID|Product|      Date|count|
#+---+-------+----------+-----+
#|  5|   Sony|2018-02-21|    1|
#|  8|     LG|2018-02-20|    1|
#|  8|  Apple|2018-02-22|    1|
#|  5|Samsung|2018-02-23|    2|
#|  5|Samsung|2018-02-20|    2|
#+---+-------+----------+-----+

Now you can use a Window to rank each product for each ID. 现在,您可以使用Window对每个ID的每个产品进行排名。 We can use pyspark.sql.functions.desc() to sort by count and Date descending. 我们可以使用pyspark.sql.functions.desc()countDate降序排序。 If the row_number() is equal to 1, that means that row is first. 如果row_number()等于1,则表示该行是第一行。

w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df = df.select(
    'Product',
    (f.row_number().over(w) == 1).cast("int").alias('Counter')
)
df.show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung|      1|
#|Samsung|      0|
#|   Sony|      0|
#|  Apple|      1|
#|     LG|      0|
#+-------+-------+

Finally groupBy() the Product and pick the value for maximum value for Counter : 最后对产品进行groupBy()并为Counter选择最大值:

df.groupBy('Product').agg(f.max('Counter').alias('Counter')).show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|   Sony|      0|
#|Samsung|      1|
#|     LG|      0|
#|  Apple|      1|
#+-------+-------+

Update 更新资料

Here's a little bit of a simpler way: 这是一个简单的方法:

w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df.groupBy('ID', 'Product')\
    .agg(f.max('Date').alias('Date'), f.count('Product').alias('Count'))\
    .select('Product', (f.row_number().over(w) == 1).cast("int").alias('Counter'))\
    .show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung|      1|
#|   Sony|      0|
#|  Apple|      1|
#|     LG|      0|
#+-------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM