简体   繁体   中英

Count of values in a row in spark dataframe using scala

I have a dataframe. It contains the amount of sales for different items across different sales outlets. The dataframe shown below only shows few of the items across few sales outlets. There's a bench mark of 100 items per day sale for each item. For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No"

val df1 = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)

Now,I want to add a column "Count_of_Yes" in which for each sales outlets (each row), the value of the column "Count_of_Yes" will be the total number of "Yes" in that row. How do I iterate over each row to get the count of Yes?

My expected dataframe should be

val output_df = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")

You can convert the selected list of columns into an Array of 1 s (for "yes") and 0 s (for "no") and sum the array elements with aggregate in SQL expression using selectExpr , as shown below:

val df = Seq(
  (1, 120, 80, 150, "Y", "N", "Y"),
  (2, 50, 90, 110, "N", "N", "Y"),
  (3, 70, 160, 90, "N", "Y", "N")
).toDF("id", "qty_a", "qty_b", "qty_c", "over100_a", "over100_b", "over100_c")

val cols = df.columns.filter(_.startsWith("over100_"))

  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  selectExpr("*", "aggregate(arr, 0, (acc, x) -> acc + x) as yes_count").
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | id|qty_a|qty_b|qty_c|over100_a|over100_b|over100_c|      arr|yes_count|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// |  1|  120|   80|  150|        Y|        N|        Y|[1, 0, 1]|        2|
// |  2|   50|   90|  110|        N|        N|        Y|[0, 0, 1]|        1|
// |  3|   70|  160|   90|        N|        Y|        N|[0, 1, 0]|        1|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+

Alternatively, use explode and groupBy/agg to sum the Array elements:

  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  withColumn("flattened", explode($"arr")).

How do I iterate over each row to get the count of Yes? You can use a map transformation to transform each record. So in your case df.map() should have the code to count number of YES and emit a new record which has this additional column.

Pseudo code as follows -

df.map(count number of YES and append that at the end of the string")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM