简体   繁体   English

使用scala的spark数据帧中一行中的值计数

[英]Count of values in a row in spark dataframe using scala

I have a dataframe.我有一个数据框。 It contains the amount of sales for different items across different sales outlets.它包含跨不同销售网点的不同项目的销售额。 The dataframe shown below only shows few of the items across few sales outlets.下面显示的数据框仅显示了几个销售网点中的少数项目。 There's a bench mark of 100 items per day sale for each item.每件商品每天销售 100 件商品的基准。 For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No"对于每件售出超过 100 件的商品,将其标记为“是”,将低于 100 件的商品标记为“否”

val df1 = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)

Now,I want to add a column "Count_of_Yes" in which for each sales outlets (each row), the value of the column "Count_of_Yes" will be the total number of "Yes" in that row.现在,我想添加一列“Count_of_Yes”,其中对于每个销售网点(每行),“Count_of_Yes”列的值将是该行中“是”的总数。 How do I iterate over each row to get the count of Yes?如何遍历每一行以获得 Yes 的计数?

My expected dataframe should be我预期的数据框应该是

val output_df = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")

You can convert the selected list of columns into an Array of 1 s (for "yes") and 0 s (for "no") and sum the array elements with aggregate in SQL expression using selectExpr , as shown below:可以列的选择的列表转换成Array1 s(对于“是”)和0 s(对于“否”),并总结与阵列元件aggregate在SQL表达式中使用selectExpr ,如下图所示:

val df = Seq(
  (1, 120, 80, 150, "Y", "N", "Y"),
  (2, 50, 90, 110, "N", "N", "Y"),
  (3, 70, 160, 90, "N", "Y", "N")
).toDF("id", "qty_a", "qty_b", "qty_c", "over100_a", "over100_b", "over100_c")

val cols = df.columns.filter(_.startsWith("over100_"))

df.
  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  selectExpr("*", "aggregate(arr, 0, (acc, x) -> acc + x) as yes_count").
  show
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | id|qty_a|qty_b|qty_c|over100_a|over100_b|over100_c|      arr|yes_count|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// |  1|  120|   80|  150|        Y|        N|        Y|[1, 0, 1]|        2|
// |  2|   50|   90|  110|        N|        N|        Y|[0, 0, 1]|        1|
// |  3|   70|  160|   90|        N|        Y|        N|[0, 1, 0]|        1|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+

Alternatively, use explode and groupBy/agg to sum the Array elements:可替代地,使用explodegroupBy/agg到求和Array元素:

df.
  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  withColumn("flattened", explode($"arr")).
  groupBy("id").agg(sum($"flattened").as("yes_count"))

How do I iterate over each row to get the count of Yes?如何遍历每一行以获得 Yes 的计数? You can use a map transformation to transform each record.您可以使用映射转换来转换每个记录。 So in your case df.map() should have the code to count number of YES and emit a new record which has this additional column.所以在你的情况下 df.map() 应该有代码来计算 YES 的数量并发出一个具有这个附加列的新记录。

Pseudo code as follows -伪代码如下——

df.map(count number of YES and append that at the end of the string")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算 Spark (Scala) 中数据帧列中的空值 - Count empty values in dataframe column in Spark (Scala) 获取DataFrame行的字段的值-Spark Scala - Getting values of Fields of a Row of DataFrame - Spark Scala 使用 spark scala 向空数据帧添加一行 - Add a row to a empty dataframe using spark scala 如何使用 Spark/Scala 在 DataFrame 行中创建嵌套 JSON 对象的计数 - How to create a count of nested JSON objects in a DataFrame row using Spark/Scala IN 聚合,基于 SUM,select Apache Spark Dataframe 中的特定行值,使用 Z3012DCFF145777E1 - IN Aggregation, based on SUM, select specific row-values in Apache Spark Dataframe, using Scala Spark数据框-使用Scala用每一行的列值替换公共字符串的标记 - Spark dataframe - Replace tokens of a common string with column values for each row using scala 根据 spark scala 中的列值迭代 dataframe 中的行 - Iterate the row in dataframe based on the column values in spark scala Scala spark - 使用累加器计算数据帧列中的空值 - Scala spark - count null value in dataframe columns using accumulator 使用scala计算spark数据帧中列组合的实例 - Count instances of combination of columns in spark dataframe using scala 使用 scala/spark 计算数据帧列中每一行的 z 分数 - calculate the z score for each row in the column of a dataframe using scala / spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM