[英]Count of values in a row in spark dataframe using scala
我有一個數據框。 它包含跨不同銷售網點的不同項目的銷售額。 下面顯示的數據框僅顯示了幾個銷售網點中的少數項目。 每件商品每天銷售 100 件商品的基准。 對於每件售出超過 100 件的商品,將其標記為“是”,將低於 100 件的商品標記為“否”
val df1 = Seq(
("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149, 129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127, 101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146, 130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94, 99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201, 229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56, 89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)
現在,我想添加一列“Count_of_Yes”,其中對於每個銷售網點(每行),“Count_of_Yes”列的值將是該行中“是”的總數。 如何遍歷每一行以獲得 Yes 的計數?
我預期的數據框應該是
val output_df = Seq(
("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149, 129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127, 101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146, 130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94, 99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201, 229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56, 89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")
可以列的選擇的列表轉換成Array
的1
s(對於“是”)和0
s(對於“否”),並總結與陣列元件aggregate
在SQL表達式中使用selectExpr
,如下圖所示:
val df = Seq(
(1, 120, 80, 150, "Y", "N", "Y"),
(2, 50, 90, 110, "N", "N", "Y"),
(3, 70, 160, 90, "N", "Y", "N")
).toDF("id", "qty_a", "qty_b", "qty_c", "over100_a", "over100_b", "over100_c")
val cols = df.columns.filter(_.startsWith("over100_"))
df.
withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
selectExpr("*", "aggregate(arr, 0, (acc, x) -> acc + x) as yes_count").
show
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | id|qty_a|qty_b|qty_c|over100_a|over100_b|over100_c| arr|yes_count|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | 1| 120| 80| 150| Y| N| Y|[1, 0, 1]| 2|
// | 2| 50| 90| 110| N| N| Y|[0, 0, 1]| 1|
// | 3| 70| 160| 90| N| Y| N|[0, 1, 0]| 1|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
可替代地,使用explode
和groupBy/agg
到求和Array
元素:
df.
withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
withColumn("flattened", explode($"arr")).
groupBy("id").agg(sum($"flattened").as("yes_count"))
如何遍歷每一行以獲得 Yes 的計數? 您可以使用映射轉換來轉換每個記錄。 所以在你的情況下 df.map() 應該有代碼來計算 YES 的數量並發出一個具有這個附加列的新記錄。
偽代碼如下——
df.map(count number of YES and append that at the end of the string")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.