簡體   English   中英

如何在 reduceByKey 結果上操作 reduceByKey

[英]How to operate reduceByKey on a reduceByKey result

我正在嘗試對reduceByKey結果執行reduceByKey 目標是看看我們每年是否有長尾效應——這里的長尾意味着我想每年(分別)看到今年銷售額的 65% 或更多來自 20% 或更少的產品。

這是我的數據集:數據集 - 年份和 asin(它的 ID)

1個

我想首先 - 按年減少,然后每年(分別)減少 asin。 因此我每年都會得到每種產品出現的次數。

我試着這樣做:

data_rdd.map(lambda x: (x.Year,(x.asin,1))).groupByKey().mapValues(list).sortBy(lambda x: x[0]).map(lambda x: x[1])

但我不明白如何對每一行進行reducebykey

謝謝

在這種情況下,我會使用SparkSQL API,因為 Windows 會非常有用。 對於每一年,讓我們計算實現至少 65% 銷售額所需的產品百分比:

# let's create some sample data. I assume we have one line per sale
df = spark.createDataFrame([('2020', 'prod1'), ('2020', 'prod1'), ('2020', 'prod1'),
    ('2020', 'prod1'), ('2020', 'prod1'), ('2020', 'prod3'), ('2020', 'prod1'),
    ('2020', 'prod2'),  ('2020', 'prod2'),  ('2020', 'prod3'),  ('2020', 'prod4'),
    ('2020', 'prod5')], ['year', 'asin'])

# let's start by counting the number of sales per product, per year
df.groupBy("year", "asin").count().show()
+----+-----+-----+
|year| asin|count|
+----+-----+-----+
|2020|prod1|    6|
|2020|prod3|    2|
|2020|prod2|    2|
|2020|prod4|    1|
|2020|prod5|    1|
+----+-----+-----+

現在,讓我們使用 Windows 來計算我們需要回答您的問題的東西:

  • 每年產品總數: product_count
  • 每年銷售總數: total_sales
  • 累計銷量,從銷量最高的產品開始: cum_sales
  • 產品索引,從銷量最高的產品開始: product_index From there, product_per is the percentage of product, sales_per the percentage of sales so that we can see if at least 65% of the sales were made by less than 20% of the products. We can finally compute the percentage of sales so that we can see if at least 65% of the sales were made by less than 20% of the products. We can finally compute dist`,即銷售百分比到 65% 的距離。 我們將使用該列保留允許達到超過 65% 的銷售額的第一行。
from pyspark.sql import Window
from pyspark.sql import functions as f

ordered_window = Window.partitionBy("year").orderBy(f.col("count").desc(), "asin")
window = Window.partitionBy("year")

rich_df = df.groupBy("year", "asin").count()\
    .withColumn("product_count", f.count(f.col("*")).over(window))\
    .withColumn("total_sales", f.sum("count").over(window))\
    .withColumn("cum_sales", f.sum("count").over(ordered_window))\
    .withColumn("product_index", f.rank().over(ordered_window))\
    .withColumn("product_per", f.col("product_index") / f.col("product_count"))\
    .withColumn("sales_per", f.col("cum_sales") / f.col("total_sales"))\
    .withColumn("dist", f.col("sales_per") - 0.65) 
rich_df.show()
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
|year| asin|count|product_count|total_sales|cum_sales|product_index|product_per|         sales_per|                dist|
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+
|2020|prod1|    6|            5|         12|        6|            1|        0.2|               0.5|-0.15000000000000002|
|2020|prod2|    2|            5|         12|        8|            2|        0.4|0.6666666666666666|0.016666666666666607|
|2020|prod3|    2|            5|         12|       10|            3|        0.6|0.8333333333333334| 0.18333333333333335|
|2020|prod4|    1|            5|         12|       11|            4|        0.8|0.9166666666666666|  0.2666666666666666|
|2020|prod5|    1|            5|         12|       12|            5|        1.0|               1.0|                0.35|
+----+-----+-----+-------------+-----------+---------+-------------+-----------+------------------+--------------------+

所以在這種情況下,我們需要 40% 的產品(5 個中的 2 個)才能達到至少 65% 的銷售額。 讓我們只保留那一行:

dist_win = Window.partitionBy("year").orderBy("dist")
rich_df.where(f.col("dist") >= 0)\
    .withColumn("dist_rank", f.rank().over(dist_win))\
    .where(f.col("dist_rank") == 1)\
    .select("year", "product_per", "sales_per", (f.col("product_per") < 0.2).alias("hasLongTail"))\
    .show()
+----+-----------+------------------+-----------+
|year|product_per|         sales_per|hasLongTail|
+----+-----------+------------------+-----------+
|2020|        0.4|0.6666666666666666|      false|
+----+-----------+------------------+-----------+

這將工作超過一年;-)

如果您想使用 RDD 並且您沒有數百萬種不同的產品,您可以結合使用reduceByKey來計算每年每種產品的銷售額,以及使用groupByKey來構建每年的銷售額列表。 然后你可以只使用 python 代碼來計算你想要的:

# this function basically computes the cumulated sum of sales counts
# then, we find the number of products needed to achieve more than 65% of the sales
def percentageOfProducts(product_sales, sales_per=0.65):
    number_of_products = len(product_sales)
    number_of_sales = sum(product_sales)
    cumulated_sales = accumulate(sorted(product_sales, reverse=True))
    index = next(s[0] for s in enumerate(cumulated_sales) if s[1] / number_of_sales >= sales_per)
    return (index + 1) / number_of_products
 
result = data_rdd\
    .map(lambda x: ((x.year, x.asin),1))\
    .reduceByKey(lambda a, b : a+b)\
    .map(lambda x: (x[0][0], x[1]))\
    .groupByKey()\
    .mapValues(percentageOfProducts)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM