[英]calculate the average by ignoring the 0 values in column
Input:输入:
item loc month year qty
watch delhi 1 2020 10
watch delhi 2 2020 0
watch delhi 3 2020 20
watch delhi 4 2020 30
watch delhi 5 2020 40
watch delhi 6 2020 50
Output:输出:
item loc month year qty avg
watch delhi 1 2020 10 0
watch delhi 2 2020 0 10
watch delhi 3 2020 20 10
watch delhi 4 2020 30 20
watch delhi 5 2020 40 25
watch delhi 6 2020 50 35
we need to calculate the avg for the previous two months....but there is a condition required while calculating the average.............we don't need to consider the qty=0 while calculating the average.....我们需要计算前两个月的平均值......但是在计算平均值时需要一个条件......我们不需要考虑qty = 0而计算平均值.....
For example: for month 3 ideally the average should be 10+0/2=5....but since we need to ignore the qty=0...so for month 3 the average will be 10/1=10....例如:对于第 3 个月,理想情况下平均值应为 10+0/2=5....但由于我们需要忽略数量 = 0...所以对于第 3 个月,平均值将为 10/1=10.. ..
Thanks in advance提前致谢
In SQL, you can use window functions with a window frame specifier:在 SQL 中,您可以使用带有窗口框架说明符的窗口函数:
select t.*,
coalesce(avg(nullif(qty, 0)) over (partition by item, loc
order by month
rows between 2 preceding and 1 preceding
),
0) as qty_avg
from t;
From the spark,从火花中,
val w = Window.partitionBy("item","loc").orderBy("month").rangeBetween(-2, -1)
df.withColumn("month", 'month.cast("int"))
.withColumn("avg", avg(when('qty =!= lit(0), 'qty)).over(w)).show()
+-----+-----+-----+----+---+----+
| item| loc|month|year|qty| avg|
+-----+-----+-----+----+---+----+
|watch|delhi| 1|2020| 10| 0.0|
|watch|delhi| 2|2020| 0|10.0|
|watch|delhi| 3|2020| 20|10.0|
|watch|delhi| 4|2020| 30|20.0|
|watch|delhi| 5|2020| 40|25.0|
|watch|delhi| 6|2020| 50|35.0|
+-----+-----+-----+----+---+----+
It can be done using in spark using lag function and WindowFrame可以使用滞后函数和WindowFrame在 spark 中完成
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
df.withColumn("month", col("month").cast(IntegerType))
.withColumn("avg", when(lag("qty", 2, 0).over(w) =!= lit(0) && lag("qty", 1, 0).over(w) =!= lit(0),
(lag("qty", 2, 0).over(w) + lag("qty", 1, 0).over(w)).divide(lit(2)))
.when(lag("qty", 1, 0).over(w) =!= lit(0),lag("qty", 1, 0).over(w)).otherwise(lag("qty", 2, 0)
.over(w))).show()
output :输出 :
+-----+-----+-----+----+---+----+
| item| loc|month|year|qty| avg|
+-----+-----+-----+----+---+----+
|watch|delhi| 1|2020| 10| 0|
|watch|delhi| 2|2020| 0| 10|
|watch|delhi| 3|2020| 20| 10|
|watch|delhi| 4|2020| 30| 20|
|watch|delhi| 5|2020| 40|25.0|
|watch|delhi| 6|2020| 50|35.0|
+-----+-----+-----+----+---+----+
I think that's a conditional indow average:我认为这是一个有条件的 indow 平均值:
select
t.*,
coalesce(avg(nullif(qty, 0)) over(partition by item, loc order by month), 0) qty_avg
from mytable t
nullif()
yields null
for 0
values - which avg()
then ignores. nullif()
为0
值产生null
- 然后avg()
忽略。 I wrapped the entire window average with coalesce()
, since you seem to want 0
when there are null
values only.我用
coalesce()
包裹了整个窗口平均值,因为当只有null
值时你似乎想要0
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.