R数据框正则表达式

Question

In the following example data frame: 在以下示例数据帧中：

# generate example data frame
data <- data.frame(matrix(data=c("a","b","c","d","e","f"), nrow=70, ncol=5))
data <- apply(data,1, function(x) {paste(x, collapse = " > ")})
data <- data.frame(id=1:length(data), x = data)
data$x <- as.character(data$x)

> head(data)
  id                 x
1  1 a > e > c > a > e
2  2 b > f > d > b > f
3  3 c > a > e > c > a
4  4 d > b > f > d > b
5  5 e > c > a > e > c
6  6 f > d > b > f > d

Some of the attributes in column x are known in advance, but not all of them. x列中的某些属性是预先已知的，但并非全部。

The attributes which are known will be replaced with individual names. 已知的属性将被替换为单个名称。 In the example the set of known attributes is {"a","c","f"}. 在该示例中，一组已知属性是{“ a”，“ c”，“ f”}。

All attributes that do not belong to this set are not known in advance and should be replaced by NA . 预先不知道所有不属于此集合的属性，应将其替换为NA 。

Step 1: Replace attributes {"a","c","f"} 步骤1：替换属性{“ a”，“ c”，“ f”}

# substitute all relevant attributes with according Names
data$x <- gsub("a", "Anton",data$x)
data$x <- gsub("c", "Chris",data$x)
data$x <- gsub("f", "Flo",data$x)

The data frame now looks as: 数据框现在如下所示：

> head(data)
  id                                 x
1  1     Anton > e > Chris > Anton > e
2  2             b > Flo > d > b > Flo
3  3 Chris > Anton > e > Chris > Anton
4  4               d > b > Flo > d > b
5  5     e > Chris > Anton > e > Chris
6  6             Flo > d > b > Flo > d

Step 2: Replace all attributes other than {"Anton", "Chris", "Flo"} with NA 步骤2：将所有{{Anton“，” Chris“，” Flo“}以外的属性替换为NA

This is where I need help. 这是我需要帮助的地方。

My idea is to make use of regular expressions and replace every value/character string that is not in {"Anton", "Chris", "Flo", ">"} with "NA". 我的想法是利用正则表达式并将{“ Anton”，“ Chris”，“ Flo”，“>”}}中不在的每个值/字符字符串替换为“ NA”。

In my real problem I don´t know the values {"b","d","e"} and the attributes can take on any value or word with length greater than 1. Moreover the values of the unkown set can change over time. 在我真正的问题中，我不知道值{“ b”，“ d”，“ e”}和属性可以采用长度大于1的任何值或单词。此外，未知集合的值可以转换时间。 So if the function will be executed in a later instance there can be new unknown values. 因此，如果该函数将在以后的实例中执行，则可能会有新的未知值。

Result: The resulting data frame should look like: 结果：结果数据框应如下所示：

> head(data)
  id                                  x
1  1    Anton > NA > Chris > Anton > NA
2  2           NA > Flo > NA > NA > Flo
3  3 Chris > Anton > NA > Chris > Anton
4  4            NA > NA > Flo > NA > NA
5  5    NA > Chris > Anton > NA > Chris
6  6           Flo > NA > NA > Flo > NA

Any help is appreciated! 任何帮助表示赞赏！

Answer 1

You could try mgsub from qdap 你可以尝试mgsub从qdap

library(qdap)
data$x <- mgsub(c('a', 'c', 'f', 'd', 'e', 'b'),
      c('Anton', 'Chris', 'Flo', 'NA', 'NA', 'NA'), data$x)
head(data,3)
#  id                                  x
#1  1    Anton > NA > Chris > Anton > NA
#2  2           NA > Flo > NA > NA > Flo
#3  3 Chris > Anton > NA > Chris > Anton

Update 更新资料

Suppose if we know only the list of elements ("v1") to be replaced by other elements "v3", then we could get the other elements ("v2") by removing the element in "v1" and the "punct" characters of "x" column with gsub . 假设如果我们只知道要用其他元素“ v3”替换的元素列表（“ v1”），则可以通过删除“ v1”和“ punct”字符中的元素来获得其他元素（“ v2”）带gsub的“ x”列。 Use this info for feeding into the mgsub 使用此信息输入mgsub

v1 <-  c('a', 'c', 'f')
v2 <- unique(scan(text=gsub(paste(c(v1,"[[:punct:]]+"),
    collapse="|"), "", data$x), what='', quiet=TRUE))

v3 <- c('Anton', 'Chris', 'Flo')
data$x <- mgsub(c(v1, v2), c(v3, rep("NA", length(v2))), data$x)
head(data,3)
 #  id                                  x
 #1  1    Anton > NA > Chris > Anton > NA
 #2  2           NA > Flo > NA > NA > Flo
 #3  3 Chris > Anton > NA > Chris > Anton

Update2 更新2

You could also do this without using any external packages 您也可以在不使用任何外部软件包的情况下执行此操作

 names(v3) <- v1
 data$x <- sapply(strsplit(data$x, ' > '), function(x)
                 paste(v3[x], collapse=" > "))
 head(data,3)
 #  id                                  x
 #1  1    Anton > NA > Chris > Anton > NA
 #2  2           NA > Flo > NA > NA > Flo
 #3  3 Chris > Anton > NA > Chris > Anton

Answer 2

This one-liner matches each word character against the names of the indicated list and replaces matches with the values associated with that name. 这种单线将每个单词字符与指定列表的名称进行匹配，并将匹配项替换为与该名称关联的值。 If there is no match then NA is used as the replacement value: 如果没有匹配项，则将NA用作替换值：

library(gsubfn)
data$x <- gsubfn("\\w", list(a = "Anton", c = "Chris", f = "Flo", NA), data$x)

giving: 给予：

> head(data)
  id                                  x
1  1    Anton > NA > Chris > Anton > NA
2  2           NA > Flo > NA > NA > Flo
3  3 Chris > Anton > NA > Chris > Anton
4  4            NA > NA > Flo > NA > NA
5  5    NA > Chris > Anton > NA > Chris
6  6           Flo > NA > NA > Flo > NA

R数据框正则表达式

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-02-11 10:49:50

Update 更新资料

Update2 更新2

解决方案2
1 2015-02-11 12:27:18

R数据框正则表达式

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-02-11 10:49:50

Update 更新资料

Update2 更新2

解决方案2 1 2015-02-11 12:27:18

解决方案1
3 已采纳 2015-02-11 10:49:50

解决方案2
1 2015-02-11 12:27:18