通过忽略列中的 0 值来计算平均值

Question

Input:输入：

item   loc    month   year     qty    
watch  delhi   1       2020     10    
watch  delhi   2       2020     0     
watch  delhi   3       2020     20    
watch  delhi   4       2020     30    
watch  delhi   5       2020     40    
watch  delhi   6       2020     50

Output:输出：

item   loc    month   year     qty    avg
watch  delhi   1       2020     10    0
watch  delhi   2       2020     0     10
watch  delhi   3       2020     20    10
watch  delhi   4       2020     30    20
watch  delhi   5       2020     40    25
watch  delhi   6       2020     50    35

we need to calculate the avg for the previous two months....but there is a condition required while calculating the average.............we don't need to consider the qty=0 while calculating the average.....我们需要计算前两个月的平均值......但是在计算平均值时需要一个条件......我们不需要考虑qty = 0而计算平均值.....

For example: for month 3 ideally the average should be 10+0/2=5....but since we need to ignore the qty=0...so for month 3 the average will be 10/1=10....例如：对于第 3 个月，理想情况下平均值应为 10+0/2=5....但由于我们需要忽略数量 = 0...所以对于第 3 个月，平均值将为 10/1=10.. ..

Thanks in advance提前致谢

Answer 1

In SQL, you can use window functions with a window frame specifier:在 SQL 中，您可以使用带有窗口框架说明符的窗口函数：

select t.*,
       coalesce(avg(nullif(qty, 0)) over (partition by item, loc
                                          order by month
                                          rows between 2 preceding and 1 preceding
                                         ),
                0) as qty_avg
from t;

Answer 2

From the spark,从火花中，

val w = Window.partitionBy("item","loc").orderBy("month").rangeBetween(-2, -1)
df.withColumn("month", 'month.cast("int"))
  .withColumn("avg", avg(when('qty =!= lit(0), 'qty)).over(w)).show()

+-----+-----+-----+----+---+----+
| item|  loc|month|year|qty| avg|
+-----+-----+-----+----+---+----+
|watch|delhi|    1|2020| 10| 0.0|
|watch|delhi|    2|2020|  0|10.0|
|watch|delhi|    3|2020| 20|10.0|
|watch|delhi|    4|2020| 30|20.0|
|watch|delhi|    5|2020| 40|25.0|
|watch|delhi|    6|2020| 50|35.0|
+-----+-----+-----+----+---+----+

Answer 3

It can be done using in spark using lag function and WindowFrame可以使用滞后函数和WindowFrame在 spark 中完成

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType



df.withColumn("month", col("month").cast(IntegerType))
.withColumn("avg", when(lag("qty", 2, 0).over(w) =!= lit(0) && lag("qty", 1, 0).over(w) =!= lit(0),
(lag("qty", 2, 0).over(w) + lag("qty", 1, 0).over(w)).divide(lit(2)))
.when(lag("qty", 1, 0).over(w) =!= lit(0),lag("qty", 1, 0).over(w)).otherwise(lag("qty", 2, 0)
.over(w))).show()

output :输出：

+-----+-----+-----+----+---+----+
| item|  loc|month|year|qty| avg|
+-----+-----+-----+----+---+----+
|watch|delhi|    1|2020| 10|   0|
|watch|delhi|    2|2020|  0|  10|
|watch|delhi|    3|2020| 20|  10|
|watch|delhi|    4|2020| 30|  20|
|watch|delhi|    5|2020| 40|25.0|
|watch|delhi|    6|2020| 50|35.0|
+-----+-----+-----+----+---+----+

Answer 4

I think that's a conditional indow average:我认为这是一个有条件的 indow 平均值：

select 
    t.*,
    coalesce(avg(nullif(qty, 0)) over(partition by item, loc order by month), 0) qty_avg
from mytable t

nullif() yields null for 0 values - which avg() then ignores. nullif()为0值产生null - 然后avg()忽略。 I wrapped the entire window average with coalesce() , since you seem to want 0 when there are null values only.我用coalesce()包裹了整个窗口平均值，因为当只有null值时你似乎想要0 。

通过忽略列中的 0 值来计算平均值

问题描述

4 个解决方案

解决方案1
4 2020-08-31 10:45:10

解决方案2
1 已采纳 2020-08-31 09:56:07

解决方案3
1 2020-08-31 15:49:37

解决方案4
0 2020-08-31 09:46:50

通过忽略列中的 0 值来计算平均值

问题描述

4 个解决方案

解决方案1 4 2020-08-31 10:45:10

解决方案2 1 已采纳 2020-08-31 09:56:07

解决方案3 1 2020-08-31 15:49:37

解决方案4 0 2020-08-31 09:46:50

解决方案1
4 2020-08-31 10:45:10

解决方案2
1 已采纳 2020-08-31 09:56:07

解决方案3
1 2020-08-31 15:49:37

解决方案4
0 2020-08-31 09:46:50