简体   繁体   中英

How to compare and combine string columns in R

I am new to R, and probably this is something stupid that everybody knows how to do, but I haven't been able to figure it out.

I created a dataframe by joining 2 dataframes, and now I have two string columns col.x and col.y, and I need to combine them into one.

The thing is that the values are not always equal, so I want to create a third column using the following criteria: (1) If the values are equal, use value from the first column (2) If one value is missing, use the available one from either one of the two columns (3) If they differ, insert "DIF"

I got a basic idea of comparing vectors from here - Replace values if two columns match in R - but I cannot get the code to work if I try to use the values from the first vector as the replacement values.

Example from the other question:

ind <- df$Au == df$Au_ppb
df[ind, c("Au", "Au_ppb")] <- "EQUAL"

What I am trying to do:

ind <- df$Au == df$Au_ppb
df[ind, c("Au", "Au_ppb")] <- df$Au

How would you do it? Is there an obvious solution?

Edited to add an example of data:

col.x          col.y 
company1       company1 
NA             company2 
company3       NA 
company4       company_4 
company 5 LTD  company 5

Edited to add a solution offered by a colleague:

df <- df %>% mutate (NewVariable=case_when(!is.na(col.x) ~ col.x, 
!is.na(col.y) ~ col.y, 
!is.na(col.x) & !is.na(col.y) & col.x!=col.y ~ "dif"))

This works if you simply need to concatenate two string variables and disregard the NA-s. The solution offered by Rémi Coulaud works for finding equal and differing lines.

I give a basic data inspired by the previous question able to answer the question, I hope :

df <- data.frame(x= c(0.2, 0.2, 0.3, 0.4, 0.3, NA),
             y = c(0.2, 0.4, 0.3, 0.6, NA, 0.4))
colnames(df) <- c("Au", "Au_ppb")

df :

   Au Au_ppb
1 0.2    0.2
2 0.2    0.4
3 0.3    0.3
4 0.4    0.6
5 0.3     NA
6  NA    0.4

One solution is this one :

# line with at last one na value
ligne_na <- is.na(df$Au) | is.na(df$Au_ppb)
df$Newcolumn[ligne_na] <- apply(df[ligne_na,], 1, sum, na.rm = T)

# diff lines
df$Newcolumn[df$Au != df$Au_ppb & !ligne_na] <- "DIF"

# equal lines
i1 <- df$Au == df$Au_ppb & !ligne_na
df$Newcolumn[i1] <- df$Au[i1]
df :

   Au Au_ppb Newcolumn
1 0.2    0.2       0.2
2 0.2    0.4       DIF
3 0.3    0.3       0.3
4 0.4    0.6       DIF
5 0.3     NA       0.3
6  NA    0.4       0.4

You can learn more about line selection and the apply function here .

EDIT 1

The problem is coming from the sum . You can't sum character type. You coul replace the first operation by this one (in the case you have only two columns.

ligne_na <- is.na(df$Au) | is.na(df$Au_ppb)
df$Newcolumn[ligne_na] <- apply(df[ligne_na,], 1, function(x){x[!is.na(x)]})

I encourage you to learn R language through this really good reference of Emanuel Paradis : here .

Here is one solution with base R , where ifelse() is used to make it:

z <- with(df,ifelse(Au==Au_ppb,"EQUAL",ifelse(Au!=Au_ppb,"DIF",NA)))
df <- within(df, Compare <- replace(z,is.na(z),rowSums(df[is.na(z),-1],na.rm = T)))

such that

> df
  Sample  Au Au_ppb Compare
1   3000 0.2    0.2   EQUAL
2   3001 0.2    0.3     DIF
3   3002 0.2    0.2   EQUAL
4   3003 0.2    0.2   EQUAL
5   3004 0.3    1.0     DIF
6   3005  NA    0.3     0.3

DATA

df <- structure(list(Sample = 3000:3005, Au = c(0.2, 0.2, 0.2, 0.2, 
0.3, NA), Au_ppb = c(0.2, 0.3, 0.2, 0.2, 1, 0.3), Compare = c("EQUAL", 
"DIF", "EQUAL", "EQUAL", "DIF", "0.3")), row.names = c(NA, -6L
), class = "data.frame")

> df
  Sample  Au Au_ppb
1   3000 0.2    0.2
2   3001 0.2    0.3
3   3002 0.2    0.2
4   3003 0.2    0.2
5   3004 0.3    1.0
6   3005  NA    0.3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM