[英]Count of values in a row in spark dataframe using scala
I have a dataframe.我有一个数据框。 It contains the amount of sales for different items across different sales outlets.它包含跨不同销售网点的不同项目的销售额。 The dataframe shown below only shows few of the items across few sales outlets.下面显示的数据框仅显示了几个销售网点中的少数项目。 There's a bench mark of 100 items per day sale for each item.每件商品每天销售 100 件商品的基准。 For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No"对于每件售出超过 100 件的商品,将其标记为“是”,将低于 100 件的商品标记为“否”
val df1 = Seq(
("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149, 129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127, 101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146, 130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94, 99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201, 229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56, 89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)
Now,I want to add a column "Count_of_Yes" in which for each sales outlets (each row), the value of the column "Count_of_Yes" will be the total number of "Yes" in that row.现在,我想添加一列“Count_of_Yes”,其中对于每个销售网点(每行),“Count_of_Yes”列的值将是该行中“是”的总数。 How do I iterate over each row to get the count of Yes?如何遍历每一行以获得 Yes 的计数?
My expected dataframe should be我预期的数据框应该是
val output_df = Seq(
("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149, 129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127, 101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146, 130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94, 99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201, 229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56, 89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")
You can convert the selected list of columns into an Array
of 1
s (for "yes") and 0
s (for "no") and sum the array elements with aggregate
in SQL expression using selectExpr
, as shown below:可以列的选择的列表转换成Array
的1
s(对于“是”)和0
s(对于“否”),并总结与阵列元件aggregate
在SQL表达式中使用selectExpr
,如下图所示:
val df = Seq(
(1, 120, 80, 150, "Y", "N", "Y"),
(2, 50, 90, 110, "N", "N", "Y"),
(3, 70, 160, 90, "N", "Y", "N")
).toDF("id", "qty_a", "qty_b", "qty_c", "over100_a", "over100_b", "over100_c")
val cols = df.columns.filter(_.startsWith("over100_"))
df.
withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
selectExpr("*", "aggregate(arr, 0, (acc, x) -> acc + x) as yes_count").
show
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | id|qty_a|qty_b|qty_c|over100_a|over100_b|over100_c| arr|yes_count|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | 1| 120| 80| 150| Y| N| Y|[1, 0, 1]| 2|
// | 2| 50| 90| 110| N| N| Y|[0, 0, 1]| 1|
// | 3| 70| 160| 90| N| Y| N|[0, 1, 0]| 1|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
Alternatively, use explode
and groupBy/agg
to sum the Array
elements:可替代地,使用explode
和groupBy/agg
到求和Array
元素:
df.
withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
withColumn("flattened", explode($"arr")).
groupBy("id").agg(sum($"flattened").as("yes_count"))
How do I iterate over each row to get the count of Yes?如何遍历每一行以获得 Yes 的计数? You can use a map transformation to transform each record.您可以使用映射转换来转换每个记录。 So in your case df.map() should have the code to count number of YES and emit a new record which has this additional column.所以在你的情况下 df.map() 应该有代码来计算 YES 的数量并发出一个具有这个附加列的新记录。
Pseudo code as follows -伪代码如下——
df.map(count number of YES and append that at the end of the string")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.