简体   繁体   中英

How to create new columns conditional of columns in a df and sum them together to one in R

I am quite new to R and have a df, in which I am creating some criteria (a1, b1, c1, d1.. and so on) by using sqldf (In this example I am only showing a1 to c1)

df <- data.frame('var1' = c('x','1', 'X', '', 'X'), "var2" = c('y','3', '', 'X', ''), "var3" = c('y','7', '', 'X', 'X'))

library(sqldf)

testcases_sql <- 
("

CASE WHEN var1 = 1  THEN 1 ELSE 0 END AS a1, 

CASE WHEN var1 = 1  AND var2 = 'y' THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 3 THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 3 THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 'X' THEN 1 ELSE 0 END AS b1,

CASE WHEN var1= 1 AND var2= 'X' AND var3=7 THEN 1 ELSE 0 END AS c1,

CASE WHEN var1= 'X' AND var3='X' THEN 1 ELSE 0 END AS c1")



sql_string = paste("SELECT *" , ",", testcases_sql, " FROM ", "df", sep=" ") 

#create criteria
data = sqldf(sql_string)
head(data)

SQLDF create a new column for each criteria

head(data)

# var1 var2 var3 a1 b1 b1 b1 b1 c1 c1
# 1    x    y    y  0  0  0  0  0  0  0
# 2    1    3    7  1  0  1  1  0  0  0
# 3    X            0  0  0  0  0  0  0
# 4         X    X  0  0  0  0  0  0  0
# 5    X         X  0  0  0  0  0  0  1

but I need all the criteria in one variable, so that all the b1's are in one column, all the c1's are in one and so on. It does not matter how many times each row meets the criterion, I only need a '1' in each column. In my original df, there is no system in how many times a criteria can be repeated, it is totally random.

My expected results are:

wished_df <- data.frame('var1' = c('x','1', 'X', '', 'X'), "var2" = c('y','3', '', 'X', ''), "var3" = c('y','7', '', 'X', 'X'), "a1" = c('0','1', '0', '0', '0'), "b1=" =c('0','1', '0', '0','0'), "c1=" =c('0','0', '0', '0','1') )

head(wished_df)
#  var1 var2 var3 a1 b1 c1
#1    x    y    y  0   0   0
#2    1    3    7  1   1   0
#3    X            0   0   0
#4         X    X  0   0   0
#5    X         X  0   0   1

It might be that sqldf is not the best function for this. My best approach would be to change the df afterwards by summing together the variabels

#sum the variable

data$newb1 <- data$b1 + data$b1 + data$b1 + data$b1

#error in `$<-.data.frame`(`*tmp*`, newb1, value = numeric(0)) : replacement has 0 rows, data has 5

#delete the old variable
data$b1 <- data$b1 <-data$b1 <- data$b1 <- NULL

#rename the variable
data$b1 <- data$newb1

#delete old variable
data$newb1 <- NULL

#repeat for c1, d1, e1 and so on...

data$newc1 <- data$c1 + data$c1

data$c1 <- data$c1 <- NULL

data$c1 <- data$newc1

data$newc1 <- NULL

Which is not working, and is quite ugly and will require a lot of code ( I have 80 testcases).

Is there an easier way to do this?

Thank a lot in advance

I would just use R's built-in boolean operators for this task. Note I have removed some logical redundancy from your SQL selections:

df <- data.frame('var1' = c('x','1', 'X', '', 'X'), 
                 "var2" = c('y','3', '', 'X', ''), 
                 "var3" = c('y','7', '', 'X', 'X'))

df$a1 <- 1 *  (df$var1 == "1")
df$b1 <- 1 * ((df$var1 == "1") & (df$var2 == "y" | df$var2 == "3"  | df$var2 == "X"))
df$c1 <- 1 * ((df$var1 == "1"  &  df$var2 == "X" & df$var3 == "7") | 
              (df$var1 == "X"  &  df$var3 == "X"))

df
#>   var1 var2 var3 a1 b1 c1
#> 1    x    y    y  0  0  0
#> 2    1    3    7  1  1  0
#> 3    X            0  0  0
#> 4         X    X  0  0  0
#> 5    X         X  0  0  1

Created on 2020-05-14 by the reprex package (v0.3.0)

In SQL we can OR the conditions to simplify the code. Each true condition will evaluate to 1 and each false condition to 0. We have changed the name of the SQL string to testcasesSQL because $ string interpolation requires word characters for the variable name -- non-word characters terminate the variable name and are not regarded as part of the variable name.

If there were some pattern to the test cases then we could generate the testcasesSQL string using R code but it is unclear if there is and we just fix the code in the question and translate it to more compact SQL.

Note that the logical condition (var1 = 1) or (var1 = 1 AND var2 = 'y') can be simplified to just (var1 = 1). Below we have NOT applied this or other potential logical simplifications to make it clear how the code in the question translates directly to simpler SQL. Also if these are generated automatically it may not be in the simplest form anyways and from the viewpoint of the answer it makes no difference.

library(sqldf)

testcasesSQL <- "(var1 = 1) or (var1 = 1  AND var2 = 'y') as a1,
  (var1 = 1 AND var2 = 'y') or (var1 = 1 AND var2 = 3) or (var1 = 1 AND var2 = 'X') AS b1,
  (var1 = 1 AND var2 = 'X' AND var3 = 7) or (var1 = 'X' AND var3 ='X') AS c1"

dfname <- "df"

fn$sqldf("select *, $testcasesSQL from $dfname")

giving:

  var1 var2 var3 a1 b1 c1
1    x    y    y  0  0  0
2    1    3    7  1  1  0
3    X            0  0  0
4         X    X  0  0  0
5    X         X  0  0  1

Generating the condition

We can define a matrix that has the condition name as column 1 with a column for var1, var2 and var3 such that the conditions on one row are AND'd and the conditions on multiple rows having the same condition name OR'd. From the example in the question it seems that var1 is always present and we have used that fact in the gsub line.

condmat <- matrix(c('c1', 1, NA, NA,
'c1', 1, 'y', NA,
'c2', 1, 'y', NA,
'c2', 1, 3, NA,
'c2', 1, 'X', NA,
'c3', 1, 'X', 7,
'c3', 'X', NA, 'X'),, 4, byrow = TRUE)
colnames(condmat) <- c("cond", "var1", "var2", "var3")

s <- sprintf("(%s = '%s' AND %s = '%s' AND %s = '%s')", 
  colnames(condmat)[2], condmat[, 2], 
  colnames(condmat)[3], condmat[, 3], 
  colnames(condmat)[4], condmat[, 4])

s2 <- gsub("AND \\w+ = 'NA'", "", s)
s3 <- tapply(s2, condmat[, 1], paste, collapse = " OR ")
cond <- paste(paste(s3, 'as', names(s3)), collapse = ",\n")

dfname <- "df"

fn$sqldf("select *, $cond from $dfname")

Note that the cond variable that is generated by the above is:

cat(cond)
## (var1 = '1'  ) OR (var1 = '1' AND var2 = 'y' ) as c1,
## (var1 = '1' AND var2 = 'y' ) OR (var1 = '1' AND var2 = '3' ) OR (var1 = '1' AND var2 = 'X' ) as c2,
## (var1 = '1' AND var2 = 'X' AND var3 = '7') OR (var1 = 'X'  AND var3 = 'X') as c3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM