[英]Compare cell value of first row of a dataframe to cell value of other rows
[英]Distribute the value of a Spark Dataframe row proportionately to other rows
當我在團隊列中有共同值時,我必須在參與同一銷售 ( id_sales ) 的團隊之間按比例分享這個共同值。
+---------------+----------------+----------------+
|id_sales |team |price |
+---------------+----------------+----------------+
|101 |Data Engineering| 200|
|102 | Front-End| 300|
|103 | Infrastructure| 100|
|103 | Software| 200|
|103 | Commum| 800|
|104 | Data Science| 500|
+---------------+----------------+----------------+
例如:在上表中,我在id_sales = 103 內有 Common 值,所以我必須計算Common對每個團隊的價值: - 基礎設施:100 - 軟件:200 所以對於基礎設施,它是 1/3 * (800 ) 並且對於軟件它是 2/3 * (800) 所以最后我的表會是這樣的:
+---------------+----------------+----------------+
|id_sales |team |price |
+---------------+----------------+----------------+
|101 |Data Engineering| 200|
|102 | Front-End| 300|
|103 | Infrastructure| 366,66|
|103 | Software| 733,66|
|104 | Data Science| 500|
+---------------+----------------+----------------+
有人可以給我一些想法或提示嗎? 提示可以在 python 或 scala (Spark 2.4) 中。
創建此表的代碼:
Pyspark
spark_df = spark.createDataFrame( \
[ \
("101", "Data Engineering", "200"),
("102", "Front-End", "300"),
("103", "Infrastructure", "100"),
("103", "Software", "200"),
("103", "Commum", "800"),
("104", "Data Science", "500") \
],
["id_sales", "team", "price"])
火花 Scala
val spark_df = Seq(
("101", "Data Engineering", "200"),
("102", "Front-End", "300"),
("103", "Infrastructure", "100"),
("103", "Software", "200"),
("103", "Commum", "800"),
("104", "Data Science", "500")
).toDF("id_sales", "team", "price")
謝謝:)
嘗試這個:
scala> val df = Seq(
| ("101", "Data Engineering", "200"),
| ("102", "Front-End", "300"),
| ("103", "Infrastructure", "100"),
| ("103", "Software", "200"),
| ("103", "Common", "800"),
| ("104", "Data Science", "500")
| ).toDF("id_sales", "team", "price")
df: org.apache.spark.sql.DataFrame = [id_sales: string, team: string ... 1 more field]
scala> df.show
+--------+----------------+-----+
|id_sales| team|price|
+--------+----------------+-----+
| 101|Data Engineering| 200|
| 102| Front-End| 300|
| 103| Infrastructure| 100|
| 103| Software| 200|
| 103| Common| 800|
| 104| Data Science| 500|
+--------+----------------+-----+
scala> val commonDF = df.filter("team='Common'")
commonDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_sales: string, team: string ... 1 more field]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ww = Window.partitionBy("id_sales")
ww: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@43324745
scala> val finalDF = df.as("main").filter("team<>'Common'").withColumn("weight",col("price")/sum("price").over(ww)).join(commonDF.as("common"), Seq("id_sales"),"left").withColumn("updated_price",when(col("common.price").isNull,df("price")).otherwise(df("price")+col("weight")*col("common.price"))).select($"id_sales",$"main.team",$"updated_price".as("price"))
finalDF: org.apache.spark.sql.DataFrame = [id_sales: string, team: string ... 1 more field]
scala> finalDF.show
+--------+----------------+------------------+
|id_sales| team| price|
+--------+----------------+------------------+
| 101|Data Engineering| 200|
| 104| Data Science| 500|
| 102| Front-End| 300|
| 103| Software| 733.3333333333333|
| 103| Infrastructure|366.66666666666663|
+--------+----------------+------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.