简体   繁体   English

如何在df中以列为条件创建新列并将它们加在一起为R中的一个

[英]How to create new columns conditional of columns in a df and sum them together to one in R

I am quite new to R and have a df, in which I am creating some criteria (a1, b1, c1, d1.. and so on) by using sqldf (In this example I am only showing a1 to c1)我对 R 很陌生,并且有一个 df,我在其中使用 sqldf 创建了一些标准(a1、b1、c1、d1.. 等等)(在这个例子中,我只显示 a1 到 c1)

df <- data.frame('var1' = c('x','1', 'X', '', 'X'), "var2" = c('y','3', '', 'X', ''), "var3" = c('y','7', '', 'X', 'X'))

library(sqldf)

testcases_sql <- 
("

CASE WHEN var1 = 1  THEN 1 ELSE 0 END AS a1, 

CASE WHEN var1 = 1  AND var2 = 'y' THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 3 THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 3 THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 'X' THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 'X' AND var3=7 THEN 1 ELSE 0 END AS c1,

CASE WHEN var1= 'X' AND var3='X' THEN 1 ELSE 0 END AS c1")



sql_string = paste("SELECT *" , ",", testcases_sql, " FROM ", "df", sep=" ") 

#create criteria
data = sqldf(sql_string)
head(data)

SQLDF create a new column for each criteria SQLDF 为每个条件创建一个新列

head(data)

# var1 var2 var3 a1 b1 b1 b1 b1 c1 c1
# 1    x    y    y  0  0  0  0  0  0  0
# 2    1    3    7  1  0  1  1  0  0  0
# 3    X            0  0  0  0  0  0  0
# 4         X    X  0  0  0  0  0  0  0
# 5    X         X  0  0  0  0  0  0  1

but I need all the criteria in one variable, so that all the b1's are in one column, all the c1's are in one and so on.但是我需要一个变量中的所有标准,以便所有 b1 都在一个列中,所有 c1 都在一个列中,依此类推。 It does not matter how many times each row meets the criterion, I only need a '1' in each column.每行符合标准多少次都没关系,我只需要每列中的“1”。 In my original df, there is no system in how many times a criteria can be repeated, it is totally random.在我原来的 df 中,没有一个标准可以重复多少次的系统,它完全是随机的。

My expected results are:我的预期结果是:

wished_df <- data.frame('var1' = c('x','1', 'X', '', 'X'), "var2" = c('y','3', '', 'X', ''), "var3" = c('y','7', '', 'X', 'X'), "a1" = c('0','1', '0', '0', '0'), "b1=" =c('0','1', '0', '0','0'), "c1=" =c('0','0', '0', '0','1') )

head(wished_df)
#  var1 var2 var3 a1 b1 c1
#1    x    y    y  0   0   0
#2    1    3    7  1   1   0
#3    X            0   0   0
#4         X    X  0   0   0
#5    X         X  0   0   1

It might be that sqldf is not the best function for this.这可能是 sqldf 不是最好的 function 。 My best approach would be to change the df afterwards by summing together the variabels我最好的方法是通过将变量相加来更改 df

#sum the variable

data$newb1 <- data$b1 + data$b1 + data$b1 + data$b1

#error in `$<-.data.frame`(`*tmp*`, newb1, value = numeric(0)) : replacement has 0 rows, data has 5

#delete the old variable
data$b1 <- data$b1 <-data$b1 <- data$b1 <- NULL

#rename the variable
data$b1 <- data$newb1

#delete old variable
data$newb1 <- NULL

#repeat for c1, d1, e1 and so on...

data$newc1 <- data$c1 + data$c1

data$c1 <- data$c1 <- NULL

data$c1 <- data$newc1

data$newc1 <- NULL

Which is not working, and is quite ugly and will require a lot of code ( I have 80 testcases).这不起作用,而且非常难看,需要大量代码(我有 80 个测试用例)。

Is there an easier way to do this?有没有更简单的方法来做到这一点?

Thank a lot in advance非常感谢提前

I would just use R's built-in boolean operators for this task.我将只使用 R 的内置 boolean 运算符来完成此任务。 Note I have removed some logical redundancy from your SQL selections:注意我已经从您的 SQL 选择中删除了一些逻辑冗余:

df <- data.frame('var1' = c('x','1', 'X', '', 'X'), 
                 "var2" = c('y','3', '', 'X', ''), 
                 "var3" = c('y','7', '', 'X', 'X'))

df$a1 <- 1 *  (df$var1 == "1")
df$b1 <- 1 * ((df$var1 == "1") & (df$var2 == "y" | df$var2 == "3"  | df$var2 == "X"))
df$c1 <- 1 * ((df$var1 == "1"  &  df$var2 == "X" & df$var3 == "7") | 
              (df$var1 == "X"  &  df$var3 == "X"))

df
#>   var1 var2 var3 a1 b1 c1
#> 1    x    y    y  0  0  0
#> 2    1    3    7  1  1  0
#> 3    X            0  0  0
#> 4         X    X  0  0  0
#> 5    X         X  0  0  1

Created on 2020-05-14 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2020 年 5 月 14 日创建

In SQL we can OR the conditions to simplify the code.在 SQL 我们可以 OR 条件来简化代码。 Each true condition will evaluate to 1 and each false condition to 0. We have changed the name of the SQL string to testcasesSQL because $ string interpolation requires word characters for the variable name -- non-word characters terminate the variable name and are not regarded as part of the variable name.每个 true 条件将评估为 1,每个 false 条件评估为 0。我们已将 SQL 字符串的名称更改为testcasesSQL ,因为 $ 字符串插值需要单词字符作为变量名 -- 非单词字符终止变量名并且不被视为作为变量名的一部分。

If there were some pattern to the test cases then we could generate the testcasesSQL string using R code but it is unclear if there is and we just fix the code in the question and translate it to more compact SQL.如果测试用例有一些模式,那么我们可以使用 R 代码生成 testcasesSQL 字符串,但不清楚是否存在,我们只是修复问题中的代码并将其转换为更紧凑的 SQL。

Note that the logical condition (var1 = 1) or (var1 = 1 AND var2 = 'y') can be simplified to just (var1 = 1).请注意,逻辑条件 (var1 = 1) 或 (var1 = 1 AND var2 = 'y') 可以简化为 (var1 = 1)。 Below we have NOT applied this or other potential logical simplifications to make it clear how the code in the question translates directly to simpler SQL.下面我们没有应用这个或其他潜在的逻辑简化来明确问题中的代码如何直接转换为更简单的 SQL。 Also if these are generated automatically it may not be in the simplest form anyways and from the viewpoint of the answer it makes no difference.此外,如果这些是自动生成的,它可能无论如何都不是最简单的形式,从答案的角度来看,它没有区别。

library(sqldf)

testcasesSQL <- "(var1 = 1) or (var1 = 1  AND var2 = 'y') as a1,
  (var1 = 1 AND var2 = 'y') or (var1 = 1 AND var2 = 3) or (var1 = 1 AND var2 = 'X') AS b1,
  (var1 = 1 AND var2 = 'X' AND var3 = 7) or (var1 = 'X' AND var3 ='X') AS c1"

dfname <- "df"

fn$sqldf("select *, $testcasesSQL from $dfname")

giving:给予:

  var1 var2 var3 a1 b1 c1
1    x    y    y  0  0  0
2    1    3    7  1  1  0
3    X            0  0  0
4         X    X  0  0  0
5    X         X  0  0  1

Generating the condition生成条件

We can define a matrix that has the condition name as column 1 with a column for var1, var2 and var3 such that the conditions on one row are AND'd and the conditions on multiple rows having the same condition name OR'd.我们可以定义一个矩阵,其条件名称为第 1 列,其中一列用于 var1、var2 和 var3,这样一行上的条件是 AND'd,而多行上的条件具有相同的条件名称 OR'd。 From the example in the question it seems that var1 is always present and we have used that fact in the gsub line.从问题中的示例看来, var1 似乎始终存在,我们在gsub行中使用了该事实。

condmat <- matrix(c('c1', 1, NA, NA,
'c1', 1, 'y', NA,
'c2', 1, 'y', NA,
'c2', 1, 3, NA,
'c2', 1, 'X', NA,
'c3', 1, 'X', 7,
'c3', 'X', NA, 'X'),, 4, byrow = TRUE)
colnames(condmat) <- c("cond", "var1", "var2", "var3")

s <- sprintf("(%s = '%s' AND %s = '%s' AND %s = '%s')", 
  colnames(condmat)[2], condmat[, 2], 
  colnames(condmat)[3], condmat[, 3], 
  colnames(condmat)[4], condmat[, 4])

s2 <- gsub("AND \\w+ = 'NA'", "", s)
s3 <- tapply(s2, condmat[, 1], paste, collapse = " OR ")
cond <- paste(paste(s3, 'as', names(s3)), collapse = ",\n")

dfname <- "df"

fn$sqldf("select *, $cond from $dfname")

Note that the cond variable that is generated by the above is:请注意,上面生成的 cond 变量是:

cat(cond)
## (var1 = '1'  ) OR (var1 = '1' AND var2 = 'y' ) as c1,
## (var1 = '1' AND var2 = 'y' ) OR (var1 = '1' AND var2 = '3' ) OR (var1 = '1' AND var2 = 'X' ) as c2,
## (var1 = '1' AND var2 = 'X' AND var3 = '7') OR (var1 = 'X'  AND var3 = 'X') as c3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM