[英]How to check if a column is null based on value of another column in pyspark?
[英]pyspark how to return the average of a column based on the value of another column?
我不認為這會很困難,但是我無法理解如何在我的 spark 數據框中取一列的平均值。
數據框看起來像:
+-------+------------+--------+------------------+
|Private|Applications|Accepted| Rate|
+-------+------------+--------+------------------+
| Yes| 417| 349|0.8369304556354916|
| Yes| 1899| 1720|0.9057398630858347|
| Yes| 1732| 1425|0.8227482678983834|
| Yes| 494| 313|0.6336032388663968|
| No| 3540| 2001|0.5652542372881356|
| No| 7313| 4664|0.6377683577191303|
| Yes| 619| 516|0.8336025848142165|
| Yes| 662| 513|0.7749244712990937|
| Yes| 761| 725|0.9526938239159002|
| Yes| 1690| 1366| 0.808284023668639|
| Yes| 6075| 5349|0.8804938271604938|
| Yes| 632| 494|0.7816455696202531|
| No| 1208| 877|0.7259933774834437|
| Yes| 20192| 13007|0.6441660063391442|
| Yes| 1436| 1228|0.8551532033426184|
| Yes| 392| 351|0.8954081632653061|
| Yes| 12586| 3239|0.2573494358811378|
| Yes| 1011| 604|0.5974282888229476|
| Yes| 848| 587|0.6922169811320755|
| Yes| 8728| 5201|0.5958982584784601|
+-------+------------+--------+------------------+
當Private
等於“Yes”時,我想返回Rate
列的平均值。 我怎樣才能做到這一點?
嘗試
df.filter(df['Private'] == 'Yes').agg({'Rate': 'avg'}).collect()[0]
做同樣事情的第三個版本是:
from pyspark.sql.functions import col, avg
df_avg = df.filter(df["Private"] == "Yes").agg(avg(col("Rate")))
df_avg.show()
這將在 Scala 中起作用。 pyspark 代碼應該非常相似。
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = List(
("yes", 10),
("yes", 30),
("No", 40)).toDF("private", "rate")
val df = l.toDF(List("private", "rate"))
val window =Window.partitionBy($"private")
df.
withColumn("avg",
when($"private" === "No", null).
otherwise(avg($"rate").over(window))
).
show()
輸入DF
+-------+----+
|private|rate|
+-------+----+
| yes| 10|
| yes| 30|
| No| 40|
+-------+----+
輸出 df
+-------+----+----+
|private|rate| avg|
+-------+----+----+
| No| 40|null|
| yes| 10|20.0|
| yes| 30|20.0|
+-------+----+----+
嘗試:
from pyspark.sql.functions import col, mean, lit
df.where(col("Private")==lit("Yes")).select(mean(col("Rate"))).collect()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.