[英]How to operate reduceByKey on a reduceByKey result
我正在嘗試對reduceByKey
結果執行reduceByKey
。 目標是看看我們每年是否有長尾效應——這里的長尾意味着我想每年(分別)看到今年銷售額的 65% 或更多來自 20% 或更少的產品。
這是我的數據集:數據集 - 年份和 asin(它的 ID)
我想首先 - 按年減少,然后每年(分別)減少 asin。 因此我每年都會得到每種產品出現的次數。
我試着這樣做:
data_rdd.map(lambda x: (x.Year,(x.asin,1))).groupByKey().mapValues(list).sortBy(lambda x: x[0]).map(lambda x: x[1])
但我不明白如何對每一行進行reducebykey
謝謝
在這種情況下,我會使用SparkSQL
API,因為 Windows 會非常有用。 對於每一年,讓我們計算實現至少 65% 銷售額所需的產品百分比:
# let's create some sample data. I assume we have one line per sale
df = spark.createDataFrame([('2020', 'prod1'), ('2020', 'prod1'), ('2020', 'prod1'),
('2020', 'prod1'), ('2020', 'prod1'), ('2020', 'prod3'), ('2020', 'prod1'),
('2020', 'prod2'), ('2020', 'prod2'), ('2020', 'prod3'), ('2020', 'prod4'),
('2020', 'prod5')], ['year', 'asin'])
# let's start by counting the number of sales per product, per year
df.groupBy("year", "asin").count().show()
+----+-----+-----+
|year| asin|count|
+----+-----+-----+
|2020|prod1| 6|
|2020|prod3| 2|
|2020|prod2| 2|
|2020|prod4| 1|
|2020|prod5| 1|
+----+-----+-----+
現在,讓我們使用 Windows 來計算我們需要回答您的問題的東西:
product_count
total_sales
cum_sales
product_index From there,
product_per is the percentage of product,
sales_per the percentage of sales so that we can see if at least 65% of the sales were made by less than 20% of the products. We can finally compute
the percentage of sales so that we can see if at least 65% of the sales were made by less than 20% of the products. We can finally compute
dist`,即銷售百分比到 65% 的距離。 我們將使用該列保留允許達到超過 65% 的銷售額的第一行。from pyspark.sql import Window
from pyspark.sql import functions as f
ordered_window = Window.partitionBy("year").orderBy(f.col("count").desc(), "asin")
window = Window.partitionBy("year")
rich_df = df.groupBy("year", "asin").count()\
.withColumn("product_count", f.count(f.col("*")).over(window))\
.withColumn("total_sales", f.sum("count").over(window))\
.withColumn("cum_sales", f.sum("count").over(ordered_window))\
.withColumn("product_index", f.rank().over(ordered_window))\
.withColumn("product_per", f.col("product_index") / f.col("product_count"))\
.withColumn("sales_per", f.col("cum_sales") / f.col("total_sales"))\
.withColumn("dist", f.col("sales_per") - 0.65)
rich_df.show()
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
|year| asin|count|product_count|total_sales|cum_sales|product_index|product_per| sales_per| dist|
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
|2020|prod1| 6| 5| 12| 6| 1| 0.2| 0.5|-0.15000000000000002|
|2020|prod2| 2| 5| 12| 8| 2| 0.4|0.6666666666666666|0.016666666666666607|
|2020|prod3| 2| 5| 12| 10| 3| 0.6|0.8333333333333334| 0.18333333333333335|
|2020|prod4| 1| 5| 12| 11| 4| 0.8|0.9166666666666666| 0.2666666666666666|
|2020|prod5| 1| 5| 12| 12| 5| 1.0| 1.0| 0.35|
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
所以在這種情況下,我們需要 40% 的產品(5 個中的 2 個)才能達到至少 65% 的銷售額。 讓我們只保留那一行:
dist_win = Window.partitionBy("year").orderBy("dist")
rich_df.where(f.col("dist") >= 0)\
.withColumn("dist_rank", f.rank().over(dist_win))\
.where(f.col("dist_rank") == 1)\
.select("year", "product_per", "sales_per", (f.col("product_per") < 0.2).alias("hasLongTail"))\
.show()
+----+-----------+------------------+-----------+
|year|product_per| sales_per|hasLongTail|
+----+-----------+------------------+-----------+
|2020| 0.4|0.6666666666666666| false|
+----+-----------+------------------+-----------+
這將工作超過一年;-)
如果您想使用 RDD 並且您沒有數百萬種不同的產品,您可以結合使用reduceByKey
來計算每年每種產品的銷售額,以及使用groupByKey
來構建每年的銷售額列表。 然后你可以只使用 python 代碼來計算你想要的:
# this function basically computes the cumulated sum of sales counts
# then, we find the number of products needed to achieve more than 65% of the sales
def percentageOfProducts(product_sales, sales_per=0.65):
number_of_products = len(product_sales)
number_of_sales = sum(product_sales)
cumulated_sales = accumulate(sorted(product_sales, reverse=True))
index = next(s[0] for s in enumerate(cumulated_sales) if s[1] / number_of_sales >= sales_per)
return (index + 1) / number_of_products
result = data_rdd\
.map(lambda x: ((x.year, x.asin),1))\
.reduceByKey(lambda a, b : a+b)\
.map(lambda x: (x[0][0], x[1]))\
.groupByKey()\
.mapValues(percentageOfProducts)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.