简体   繁体   English

如何比较和组合R中的字符串列

[英]How to compare and combine string columns in R

I am new to R, and probably this is something stupid that everybody knows how to do, but I haven't been able to figure it out.我是 R 的新手,可能这是每个人都知道该怎么做的愚蠢的事情,但我一直无法弄清楚。

I created a dataframe by joining 2 dataframes, and now I have two string columns col.x and col.y, and I need to combine them into one.我通过加入 2 个数据帧创建了一个数据帧,现在我有两个字符串列 col.x 和 col.y,我需要将它们合并为一个。

The thing is that the values are not always equal, so I want to create a third column using the following criteria: (1) If the values are equal, use value from the first column (2) If one value is missing, use the available one from either one of the two columns (3) If they differ, insert "DIF"问题是这些值并不总是相等,所以我想使用以下标准创建第三列:(1) 如果值相等,则使用第一列中的值 (2) 如果缺少一个值,请使用可从两列中的任一列中获得一个 (3) 如果它们不同,请插入“DIF”

I got a basic idea of comparing vectors from here - Replace values if two columns match in R - but I cannot get the code to work if I try to use the values from the first vector as the replacement values.我从这里得到了比较向量的基本想法 -如果 R 中的两列匹配,则替换值- 但如果我尝试使用第一个向量中的值作为替换值,则无法使代码正常工作。

Example from the other question:来自另一个问题的示例:

ind <- df$Au == df$Au_ppb
df[ind, c("Au", "Au_ppb")] <- "EQUAL"

What I am trying to do:我正在尝试做的事情:

ind <- df$Au == df$Au_ppb
df[ind, c("Au", "Au_ppb")] <- df$Au

How would you do it?你会怎么做? Is there an obvious solution?有明显的解决方案吗?

Edited to add an example of data:编辑以添加数据示例:

col.x          col.y 
company1       company1 
NA             company2 
company3       NA 
company4       company_4 
company 5 LTD  company 5

Edited to add a solution offered by a colleague:编辑添加同事提供的解决方案:

df <- df %>% mutate (NewVariable=case_when(!is.na(col.x) ~ col.x, 
!is.na(col.y) ~ col.y, 
!is.na(col.x) & !is.na(col.y) & col.x!=col.y ~ "dif"))

This works if you simply need to concatenate two string variables and disregard the NA-s.如果您只需要连接两个字符串变量并忽略 NA-s,则此方法有效。 The solution offered by Rémi Coulaud works for finding equal and differing lines. Rémi Coulaud提供的解决方案用于寻找相等和不同的线。

I give a basic data inspired by the previous question able to answer the question, I hope :我给出了一个受上一个问题启发的基本数据能够回答这个问题,我希望:

df <- data.frame(x= c(0.2, 0.2, 0.3, 0.4, 0.3, NA),
             y = c(0.2, 0.4, 0.3, 0.6, NA, 0.4))
colnames(df) <- c("Au", "Au_ppb")

df :

   Au Au_ppb
1 0.2    0.2
2 0.2    0.4
3 0.3    0.3
4 0.4    0.6
5 0.3     NA
6  NA    0.4

One solution is this one :一种解决方案是这样的:

# line with at last one na value
ligne_na <- is.na(df$Au) | is.na(df$Au_ppb)
df$Newcolumn[ligne_na] <- apply(df[ligne_na,], 1, sum, na.rm = T)

# diff lines
df$Newcolumn[df$Au != df$Au_ppb & !ligne_na] <- "DIF"

# equal lines
i1 <- df$Au == df$Au_ppb & !ligne_na
df$Newcolumn[i1] <- df$Au[i1]
df :

   Au Au_ppb Newcolumn
1 0.2    0.2       0.2
2 0.2    0.4       DIF
3 0.3    0.3       0.3
4 0.4    0.6       DIF
5 0.3     NA       0.3
6  NA    0.4       0.4

You can learn more about line selection and the apply function here .您可以在此处了解有关行选择和apply功能的更多信息。

EDIT 1编辑 1

The problem is coming from the sum .问题来自sum You can't sum character type.你不能sum字符类型。 You coul replace the first operation by this one (in the case you have only two columns.你可以用这个替换第一个操作(如果你只有两列。

ligne_na <- is.na(df$Au) | is.na(df$Au_ppb)
df$Newcolumn[ligne_na] <- apply(df[ligne_na,], 1, function(x){x[!is.na(x)]})

I encourage you to learn R language through this really good reference of Emanuel Paradis : here .我鼓励您通过 Emanuel Paradis 的这个非常好的参考来学习 R 语言: 这里

Here is one solution with base R , where ifelse() is used to make it:这是一个base R解决方案,其中ifelse()用于制作它:

z <- with(df,ifelse(Au==Au_ppb,"EQUAL",ifelse(Au!=Au_ppb,"DIF",NA)))
df <- within(df, Compare <- replace(z,is.na(z),rowSums(df[is.na(z),-1],na.rm = T)))

such that以至于

> df
  Sample  Au Au_ppb Compare
1   3000 0.2    0.2   EQUAL
2   3001 0.2    0.3     DIF
3   3002 0.2    0.2   EQUAL
4   3003 0.2    0.2   EQUAL
5   3004 0.3    1.0     DIF
6   3005  NA    0.3     0.3

DATA数据

df <- structure(list(Sample = 3000:3005, Au = c(0.2, 0.2, 0.2, 0.2, 
0.3, NA), Au_ppb = c(0.2, 0.3, 0.2, 0.2, 1, 0.3), Compare = c("EQUAL", 
"DIF", "EQUAL", "EQUAL", "DIF", "0.3")), row.names = c(NA, -6L
), class = "data.frame")

> df
  Sample  Au Au_ppb
1   3000 0.2    0.2
2   3001 0.2    0.3
3   3002 0.2    0.2
4   3003 0.2    0.2
5   3004 0.3    1.0
6   3005  NA    0.3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM