使用scala的spark数据帧中一行中的值计数

Question

I have a dataframe.我有一个数据框。 It contains the amount of sales for different items across different sales outlets.它包含跨不同销售网点的不同项目的销售额。 The dataframe shown below only shows few of the items across few sales outlets.下面显示的数据框仅显示了几个销售网点中的少数项目。 There's a bench mark of 100 items per day sale for each item.每件商品每天销售 100 件商品的基准。 For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No"对于每件售出超过 100 件的商品，将其标记为“是”，将低于 100 件的商品标记为“否”

val df1 = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)

Now,I want to add a column "Count_of_Yes" in which for each sales outlets (each row), the value of the column "Count_of_Yes" will be the total number of "Yes" in that row.现在，我想添加一列“Count_of_Yes”，其中对于每个销售网点（每行），“Count_of_Yes”列的值将是该行中“是”的总数。 How do I iterate over each row to get the count of Yes?如何遍历每一行以获得 Yes 的计数？

My expected dataframe should be我预期的数据框应该是

val output_df = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")

Answer 1

You can convert the selected list of columns into an Array of 1 s (for "yes") and 0 s (for "no") and sum the array elements with aggregate in SQL expression using selectExpr , as shown below:可以列的选择的列表转换成Array的1 s（对于“是”）和0 s（对于“否”），并总结与阵列元件aggregate在SQL表达式中使用selectExpr ，如下图所示：

val df = Seq(
  (1, 120, 80, 150, "Y", "N", "Y"),
  (2, 50, 90, 110, "N", "N", "Y"),
  (3, 70, 160, 90, "N", "Y", "N")
).toDF("id", "qty_a", "qty_b", "qty_c", "over100_a", "over100_b", "over100_c")

val cols = df.columns.filter(_.startsWith("over100_"))

df.
  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  selectExpr("*", "aggregate(arr, 0, (acc, x) -> acc + x) as yes_count").
  show
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | id|qty_a|qty_b|qty_c|over100_a|over100_b|over100_c|      arr|yes_count|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// |  1|  120|   80|  150|        Y|        N|        Y|[1, 0, 1]|        2|
// |  2|   50|   90|  110|        N|        N|        Y|[0, 0, 1]|        1|
// |  3|   70|  160|   90|        N|        Y|        N|[0, 1, 0]|        1|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+

Alternatively, use explode and groupBy/agg to sum the Array elements:可替代地，使用explode和groupBy/agg到求和Array元素：

df.
  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  withColumn("flattened", explode($"arr")).
  groupBy("id").agg(sum($"flattened").as("yes_count"))

Answer 2

How do I iterate over each row to get the count of Yes?如何遍历每一行以获得 Yes 的计数？ You can use a map transformation to transform each record.您可以使用映射转换来转换每个记录。 So in your case df.map() should have the code to count number of YES and emit a new record which has this additional column.所以在你的情况下 df.map() 应该有代码来计算 YES 的数量并发出一个具有这个附加列的新记录。

Pseudo code as follows -伪代码如下——

df.map(count number of YES and append that at the end of the string")

使用scala的spark数据帧中一行中的值计数

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-11-04 00:52:57

解决方案2
0 2020-11-03 17:22:35

使用scala的spark数据帧中一行中的值计数

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-11-04 00:52:57

解决方案2 0 2020-11-03 17:22:35

解决方案1
1 已采纳 2020-11-04 00:52:57

解决方案2
0 2020-11-03 17:22:35