简体   繁体   English

Sparklyr / Dplyr-如何为sparkdata框架的每一行应用用户定义的函数,以及如何将每一行的输出写入新列?

[英]Sparklyr/Dplyr - How to apply a user defined function for each row of a sparkdata frame and create write the output of each row to new column?

I have a spark_tbl containing 160+ columns. 我有一个包含160多个列的spark_tbl。

I will give an example to show how the dataframe looks: 我将举一个示例来展示数据框的外观:

Key  A  B  C  D  E  F  G .....Z

s1   0  1  0  1  1  0  1      0
s2   1  0  0  0  0  0  0      0
s3   1  1  0  0  0  0  0      0
s4   0  1  0  1  1  0  0      0

What I want to achieve is to create a new column based on the values is each column like, 我要实现的是根据值创建一个新列,就像每个列一样,

Key  A  B  C  D  E  F  G .....Z  panel

s1   0  1  0  1  1  0  1      0  B,D,E,G
s2   1  0  0  0  0  0  0      0  A 
s3   1  1  0  0  0  0  0      0  A,B
s4   0  1  0  1  1  0  0      0  B,D,E

Check each column rowwise and add the column name to the string if the value is 1 and finally write it to a column called panel. 逐行检查每列,如果值是1,则将列名称添加到字符串中,最后将其写入名为panel的列中。

My attempt at writing a user defined function: 我尝试编写用户定义的函数:

get_panel <- function(eachrow){
 id <- ""
 row_list <- as.list(eachrow)
 for (i in 1:length(row_list)){
  if(row_list[i] == "1"){
   if(id == ""){
     id = columns[i+1]
   }else{
     id = paste(id, ",", columns[i+1])
   }
  }
 }
return(id)
}

This works with regular dataframe using apply function. 这适用于使用apply函数的常规数据框。 But, 但,

How to apply this function to Spark Dataframe or tbl_spark? 如何将此功能应用于Spark Dataframe或tbl_spark?

I think that @JasonAizkalns is on the right track. 我认为@JasonAizkalns在正确的轨道上。 Starting with his example: 从他的例子开始:

library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")


mat <- matrix(c(paste0("s", 1:4), as.numeric(sample(0:1, 4 * 26, TRUE))), ncol = 27)
colnames(mat) <- c("Key", LETTERS[1:26])

df <- data.frame(mat, stringsAsFactors = FALSE) %>%
  mutate_at(vars(-"Key"), as.numeric) %>%
  as_data_frame()
df

dfs <- copy_to(sc, df, overwrite = TRUE)

We can get there using a little rlang magic. 我们可以使用一点rlang魔术来到达那里。

dfs <- dfs %>% mutate(panel = "")
for (letter in LETTERS[1:26]) {
  dfs <- dfs %>% mutate(panel = concat_ws(",", panel, ifelse(!!sym(letter) == 1.0, yes = letter, no = NA)))
}

dfs %>% 
  mutate(panel = regexp_replace(panel, "^,", "")) %>% # remove leading comma
  select(Key, A:D, panel)

Gives what I think you want 给我我想要的

# Source: spark<?> [?? x 6]
  Key       A     B     C     D panel                           
* <chr> <dbl> <dbl> <dbl> <dbl> <chr>                           
1 s1        0     0     1     1 C,D,E,G,O,P,Q,U,Z              
2 s2        1     0     0     1 A,D,G,K,L,M,N,Q,S,U,W          
3 s3        0     1     0     0 B,E,L,M,O,Q,R,S,T,Y            
4 s4        1     1     0     1 A,B,D,E,G,I,J,M,N,R,S,T,U,V,Y,Z

The key here is the concat_ws Spark SQL (not R) function. 此处的关键是concat_ws Spark SQL(非R)函数。 See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#concat_ws-java.lang.String-org.apache.spark.sql.Column...- 请参阅https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#concat_ws-java.lang.String-org.apache.spark.sql.Column.。 .-

Check out this scala solution. 查看此scala解决方案。

scala> val df = Seq(("s1",0,1,0,1,1,0,1),
     | ("s2",1,0,0,0,0,0,0),
     | ("s3",1,1,0,0,0,0,0),
     | ("s4",0,1,0,1,1,0,0)).toDF("key","A","B","C","D","E","F","G")
df: org.apache.spark.sql.DataFrame = [key: string, A: int ... 6 more fields]

scala> df.show
+---+---+---+---+---+---+---+---+
|key|  A|  B|  C|  D|  E|  F|  G|
+---+---+---+---+---+---+---+---+
| s1|  0|  1|  0|  1|  1|  0|  1|
| s2|  1|  0|  0|  0|  0|  0|  0|
| s3|  1|  1|  0|  0|  0|  0|  0|
| s4|  0|  1|  0|  1|  1|  0|  0|
+---+---+---+---+---+---+---+---+

scala> val columns = df.columns.filter(x=>x != "key")
columns: Array[String] = Array(A, B, C, D, E, F, G)

scala> val p1 = columns.map( x => when(col(x)===lit(1),x+",").otherwise(lit(""))).reduce(concat(_,_)).as("panel")
p1: org.apache.spark.sql.Column = concat(concat(concat(concat(concat(concat(CASE WHEN (A = 1) THEN A, ELSE  END, CASE WHEN (B = 1) THEN B, ELSE  END), CASE WHEN (C = 1) THEN C, ELSE  END), CASE WHEN (D = 1) THEN D, ELSE  END), CASE WHEN (E = 1) THEN E, ELSE  END), CASE WHEN (F = 1) THEN F, ELSE  END), CASE WHEN (G = 1) THEN G, ELSE  END) AS `panel`

scala> df.select(p1).show(false)
+--------+
|panel   |
+--------+
|B,D,E,G,|
|A,      |
|A,B,    |
|B,D,E,  |
+--------+

With all columns, 对于所有列,

scala> df.select(col("*"), p1).show
+---+---+---+---+---+---+---+---+--------+
|key|  A|  B|  C|  D|  E|  F|  G|   panel|
+---+---+---+---+---+---+---+---+--------+
| s1|  0|  1|  0|  1|  1|  0|  1|B,D,E,G,|
| s2|  1|  0|  0|  0|  0|  0|  0|      A,|
| s3|  1|  1|  0|  0|  0|  0|  0|    A,B,|
| s4|  0|  1|  0|  1|  1|  0|  0|  B,D,E,|
+---+---+---+---+---+---+---+---+--------+

There is a trailing comma in the result. 结果中有一个逗号结尾。 That can be removed by 可以通过删除

scala> df.select(col("*"), regexp_replace(p1,",$","").as("panel")).show
+---+---+---+---+---+---+---+---+-------+
|key|  A|  B|  C|  D|  E|  F|  G|  panel|
+---+---+---+---+---+---+---+---+-------+
| s1|  0|  1|  0|  1|  1|  0|  1|B,D,E,G|
| s2|  1|  0|  0|  0|  0|  0|  0|      A|
| s3|  1|  1|  0|  0|  0|  0|  0|    A,B|
| s4|  0|  1|  0|  1|  1|  0|  0|  B,D,E|
+---+---+---+---+---+---+---+---+-------+


scala> 

EDIT2: EDIT2:

A more cleaner approach would be to use just array() function with concat_ws

scala> val df = Seq(("s1",0,1,0,1,1,0,1),("s2",1,0,0,0,0,0,0),("s3",1,1,0,0,0,0,0),("s4",0,1,0,1,1,0,0)).toDF("key","A","B","C","D","E","F","G")
df: org.apache.spark.sql.DataFrame = [key: string, A: int ... 6 more fields]

scala> df.show(false)
+---+---+---+---+---+---+---+---+
|key|A  |B  |C  |D  |E  |F  |G  |
+---+---+---+---+---+---+---+---+
|s1 |0  |1  |0  |1  |1  |0  |1  |
|s2 |1  |0  |0  |0  |0  |0  |0  |
|s3 |1  |1  |0  |0  |0  |0  |0  |
|s4 |0  |1  |0  |1  |1  |0  |0  |
+---+---+---+---+---+---+---+---+


scala> val p1 = columns.map( x => when(col(x)===lit(1),x).otherwise(null))
p1: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (A = 1) THEN A ELSE NULL END, CASE WHEN (B = 1) THEN B ELSE NULL END, CASE WHEN (C = 1) THEN C ELSE NULL END, CASE WHEN (D = 1) THEN D ELSE NULL END, CASE WHEN (E = 1) THEN E ELSE NULL END, CASE WHEN (F = 1) THEN F ELSE NULL END, CASE WHEN (G = 1) THEN G ELSE NULL END)

scala> df.select(col("*"),array(p1:_*).alias("panel")).withColumn("panel2",concat_ws(",",'panel)).show(false)
+---+---+---+---+---+---+---+---+----------------+-------+
|key|A  |B  |C  |D  |E  |F  |G  |panel           |panel2 |
+---+---+---+---+---+---+---+---+----------------+-------+
|s1 |0  |1  |0  |1  |1  |0  |1  |[, B,, D, E,, G]|B,D,E,G|
|s2 |1  |0  |0  |0  |0  |0  |0  |[A,,,,,,]       |A      |
|s3 |1  |1  |0  |0  |0  |0  |0  |[A, B,,,,,]     |A,B    |
|s4 |0  |1  |0  |1  |1  |0  |0  |[, B,, D, E,,]  |B,D,E  |
+---+---+---+---+---+---+---+---+----------------+-------+


scala>

Not sure if this will translate 100% to sparklyr , but you may be able to use sdf_nest : 不知道这是否会将100%转换为sparklyr ,但是您可以使用sdf_nest

library(tidyverse)

mat <- matrix(c(paste0("s", 1:4), as.numeric(sample(0:1, 4 * 26, TRUE))), ncol = 27)
colnames(mat) <- c("Key", LETTERS[1:26])

df <- data.frame(mat, stringsAsFactors = FALSE) %>%
  mutate_at(vars(-"Key"), as.numeric) %>%
  as_data_frame()
df
#> # A tibble: 4 x 27
#>   Key       A     B     C     D     E     F     G     H     I     J     K
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 s1        0     1     1     1     1     0     0     0     0     1     1
#> 2 s2        0     1     0     1     0     1     1     1     1     0     0
#> 3 s3        0     1     1     1     1     0     0     0     0     1     1
#> 4 s4        0     0     0     1     0     0     0     1     1     0     1
#> # ... with 15 more variables: L <dbl>, M <dbl>, N <dbl>, O <dbl>, P <dbl>,
#> #   Q <dbl>, R <dbl>, S <dbl>, T <dbl>, U <dbl>, V <dbl>, W <dbl>,
#> #   X <dbl>, Y <dbl>, Z <dbl>

df %>%
  group_by(Key) %>%
  nest() %>%
  mutate(panel = map_chr(data, ~ unlist(.) %>% as.logical %>% names(df)[-1][.] %>% paste(collapse = ",")))
#> # A tibble: 4 x 3
#>   Key   data              panel                           
#>   <chr> <list>            <chr>                          
#> 1 s1    <tibble [1 x 26]> B,C,D,E,J,K,L,M,N,O,P,Q,R,W,Y,Z
#> 2 s2    <tibble [1 x 26]> B,D,F,G,H,I,N,R,S,T,V,W,X,Z    
#> 3 s3    <tibble [1 x 26]> B,C,D,E,J,K,M,N,O,Q,R,S,T,V,X,Y
#> 4 s4    <tibble [1 x 26]> D,H,I,K,L,O,P,T,U,V,W,Z

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 function 应用于 dataframe 中的列的每一行以创建新列 - Apply a function to each row of a column in a dataframe to create a new column 如何在数据框的每一行上执行一个函数,并将该输出的一个元素作为该行中的新列插入 - How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row 将函数应用于data.frame的每一行并保留列类 - Apply function to each row of data.frame and preserve column classes 将函数应用于dplyr group中每个组的每一行 - Apply function to each row for each group in dplyr group by 使用mutate创建新列,该变量与数据帧(dplyr)中每一行的一组指定列的内容有关 - Creating a new column using mutate which is some function of the contents of a specified set of columns for each row in a data frame (dplyr) 如何在数据框中创建一个新列,新列的每一行都是之前所有行的乘积 - How to create a new column in data frame which each row of new column is multiplication of all previous rows 在包含每一行的最大日期的数据框中创建一个新列 - Create a new column on a data frame containing max date for each row 使用每一行的条件在data.frame中创建一个新列 - Create a new column in data.frame using conditions of each row 如何在不使用硬编码列名的情况下使用dplyr将函数逐行应用到数据帧中 - How to apply function row-by-row into a data frame using dplyr without hardcoding the column names 如何有效地将 rbinom 函数应用于数据框中的每一行? - How to efficiently apply the rbinom function to each row in a data frame?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM