如何在 reduceByKey 結果上操作 reduceByKey

Question

我正在嘗試對reduceByKey結果執行reduceByKey 。 目標是看看我們每年是否有長尾效應——這里的長尾意味着我想每年（分別）看到今年銷售額的 65% 或更多來自 20% 或更少的產品。

這是我的數據集：數據集 - 年份和 asin（它的 ID）

我想首先 - 按年減少，然后每年（分別）減少 asin。 因此我每年都會得到每種產品出現的次數。

我試着這樣做：

data_rdd.map(lambda x: (x.Year,(x.asin,1))).groupByKey().mapValues(list).sortBy(lambda x: x[0]).map(lambda x: x[1])

但我不明白如何對每一行進行reducebykey

謝謝

Answer 1

在這種情況下，我會使用SparkSQL API，因為 Windows 會非常有用。 對於每一年，讓我們計算實現至少 65% 銷售額所需的產品百分比：

# let's create some sample data. I assume we have one line per sale
df = spark.createDataFrame([('2020', 'prod1'), ('2020', 'prod1'), ('2020', 'prod1'),
    ('2020', 'prod1'), ('2020', 'prod1'), ('2020', 'prod3'), ('2020', 'prod1'),
    ('2020', 'prod2'),  ('2020', 'prod2'),  ('2020', 'prod3'),  ('2020', 'prod4'),
    ('2020', 'prod5')], ['year', 'asin'])

# let's start by counting the number of sales per product, per year
df.groupBy("year", "asin").count().show()

+----+-----+-----+
|year| asin|count|
+----+-----+-----+
|2020|prod1|    6|
|2020|prod3|    2|
|2020|prod2|    2|
|2020|prod4|    1|
|2020|prod5|    1|
+----+-----+-----+

現在，讓我們使用 Windows 來計算我們需要回答您的問題的東西：

每年產品總數： product_count
每年銷售總數： total_sales
累計銷量，從銷量最高的產品開始： cum_sales
產品索引，從銷量最高的產品開始： product_index From there, product_per is the percentage of product, sales_per the percentage of sales so that we can see if at least 65% of the sales were made by less than 20% of the products. We can finally compute the percentage of sales so that we can see if at least 65% of the sales were made by less than 20% of the products. We can finally compute dist`，即銷售百分比到 65% 的距離。 我們將使用該列保留允許達到超過 65% 的銷售額的第一行。

from pyspark.sql import Window
from pyspark.sql import functions as f

ordered_window = Window.partitionBy("year").orderBy(f.col("count").desc(), "asin")
window = Window.partitionBy("year")

rich_df = df.groupBy("year", "asin").count()\
    .withColumn("product_count", f.count(f.col("*")).over(window))\
    .withColumn("total_sales", f.sum("count").over(window))\
    .withColumn("cum_sales", f.sum("count").over(ordered_window))\
    .withColumn("product_index", f.rank().over(ordered_window))\
    .withColumn("product_per", f.col("product_index") / f.col("product_count"))\
    .withColumn("sales_per", f.col("cum_sales") / f.col("total_sales"))\
    .withColumn("dist", f.col("sales_per") - 0.65) 
rich_df.show()

+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
|year| asin|count|product_count|total_sales|cum_sales|product_index|product_per|         sales_per|                dist|
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
|2020|prod1|    6|            5|         12|        6|            1|        0.2|               0.5|-0.15000000000000002|
|2020|prod2|    2|            5|         12|        8|            2|        0.4|0.6666666666666666|0.016666666666666607|
|2020|prod3|    2|            5|         12|       10|            3|        0.6|0.8333333333333334| 0.18333333333333335|
|2020|prod4|    1|            5|         12|       11|            4|        0.8|0.9166666666666666|  0.2666666666666666|
|2020|prod5|    1|            5|         12|       12|            5|        1.0|               1.0|                0.35|
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+

所以在這種情況下，我們需要 40% 的產品（5 個中的 2 個）才能達到至少 65% 的銷售額。 讓我們只保留那一行：

dist_win = Window.partitionBy("year").orderBy("dist")
rich_df.where(f.col("dist") >= 0)\
    .withColumn("dist_rank", f.rank().over(dist_win))\
    .where(f.col("dist_rank") == 1)\
    .select("year", "product_per", "sales_per", (f.col("product_per") < 0.2).alias("hasLongTail"))\
    .show()

+----+-----------+------------------+-----------+
|year|product_per|         sales_per|hasLongTail|
+----+-----------+------------------+-----------+
|2020|        0.4|0.6666666666666666|      false|
+----+-----------+------------------+-----------+

這將工作超過一年；-)

Answer 2

如果您想使用 RDD 並且您沒有數百萬種不同的產品，您可以結合使用reduceByKey來計算每年每種產品的銷售額，以及使用groupByKey來構建每年的銷售額列表。 然后你可以只使用 python 代碼來計算你想要的：

# this function basically computes the cumulated sum of sales counts
# then, we find the number of products needed to achieve more than 65% of the sales
def percentageOfProducts(product_sales, sales_per=0.65):
    number_of_products = len(product_sales)
    number_of_sales = sum(product_sales)
    cumulated_sales = accumulate(sorted(product_sales, reverse=True))
    index = next(s[0] for s in enumerate(cumulated_sales) if s[1] / number_of_sales >= sales_per)
    return (index + 1) / number_of_products
 
result = data_rdd\
    .map(lambda x: ((x.year, x.asin),1))\
    .reduceByKey(lambda a, b : a+b)\
    .map(lambda x: (x[0][0], x[1]))\
    .groupByKey()\
    .mapValues(percentageOfProducts)

如何在 reduceByKey 結果上操作 reduceByKey

問題描述

2 個解決方案

解決方案1
0 2022-12-23 19:21:36

解決方案2
0 2022-12-31 14:45:14

如何在 reduceByKey 結果上操作 reduceByKey

問題描述

2 個解決方案

解決方案1 0 2022-12-23 19:21:36

解決方案2 0 2022-12-31 14:45:14

解決方案1
0 2022-12-23 19:21:36

解決方案2
0 2022-12-31 14:45:14