简体   繁体   中英

R dataframe regular expression

In the following example data frame:

# generate example data frame
data <- data.frame(matrix(data=c("a","b","c","d","e","f"), nrow=70, ncol=5))
data <- apply(data,1, function(x) {paste(x, collapse = " > ")})
data <- data.frame(id=1:length(data), x = data)
data$x <- as.character(data$x)

> head(data)
  id                 x
1  1 a > e > c > a > e
2  2 b > f > d > b > f
3  3 c > a > e > c > a
4  4 d > b > f > d > b
5  5 e > c > a > e > c
6  6 f > d > b > f > d

Some of the attributes in column x are known in advance, but not all of them.

The attributes which are known will be replaced with individual names. In the example the set of known attributes is {"a","c","f"}.

All attributes that do not belong to this set are not known in advance and should be replaced by NA .

Step 1: Replace attributes {"a","c","f"}

# substitute all relevant attributes with according Names
data$x <- gsub("a", "Anton",data$x)
data$x <- gsub("c", "Chris",data$x)
data$x <- gsub("f", "Flo",data$x)

The data frame now looks as:

> head(data)
  id                                 x
1  1     Anton > e > Chris > Anton > e
2  2             b > Flo > d > b > Flo
3  3 Chris > Anton > e > Chris > Anton
4  4               d > b > Flo > d > b
5  5     e > Chris > Anton > e > Chris
6  6             Flo > d > b > Flo > d

Step 2: Replace all attributes other than {"Anton", "Chris", "Flo"} with NA

This is where I need help.

My idea is to make use of regular expressions and replace every value/character string that is not in {"Anton", "Chris", "Flo", ">"} with "NA".

In my real problem I don´t know the values {"b","d","e"} and the attributes can take on any value or word with length greater than 1. Moreover the values of the unkown set can change over time. So if the function will be executed in a later instance there can be new unknown values.

Result: The resulting data frame should look like:

> head(data)
  id                                  x
1  1    Anton > NA > Chris > Anton > NA
2  2           NA > Flo > NA > NA > Flo
3  3 Chris > Anton > NA > Chris > Anton
4  4            NA > NA > Flo > NA > NA
5  5    NA > Chris > Anton > NA > Chris
6  6           Flo > NA > NA > Flo > NA

Any help is appreciated!

You could try mgsub from qdap

library(qdap)
data$x <- mgsub(c('a', 'c', 'f', 'd', 'e', 'b'),
      c('Anton', 'Chris', 'Flo', 'NA', 'NA', 'NA'), data$x)
head(data,3)
#  id                                  x
#1  1    Anton > NA > Chris > Anton > NA
#2  2           NA > Flo > NA > NA > Flo
#3  3 Chris > Anton > NA > Chris > Anton

Update

Suppose if we know only the list of elements ("v1") to be replaced by other elements "v3", then we could get the other elements ("v2") by removing the element in "v1" and the "punct" characters of "x" column with gsub . Use this info for feeding into the mgsub

v1 <-  c('a', 'c', 'f')
v2 <- unique(scan(text=gsub(paste(c(v1,"[[:punct:]]+"),
    collapse="|"), "", data$x), what='', quiet=TRUE))

v3 <- c('Anton', 'Chris', 'Flo')
data$x <- mgsub(c(v1, v2), c(v3, rep("NA", length(v2))), data$x)
head(data,3)
 #  id                                  x
 #1  1    Anton > NA > Chris > Anton > NA
 #2  2           NA > Flo > NA > NA > Flo
 #3  3 Chris > Anton > NA > Chris > Anton

Update2

You could also do this without using any external packages

 names(v3) <- v1
 data$x <- sapply(strsplit(data$x, ' > '), function(x)
                 paste(v3[x], collapse=" > "))
 head(data,3)
 #  id                                  x
 #1  1    Anton > NA > Chris > Anton > NA
 #2  2           NA > Flo > NA > NA > Flo
 #3  3 Chris > Anton > NA > Chris > Anton

This one-liner matches each word character against the names of the indicated list and replaces matches with the values associated with that name. If there is no match then NA is used as the replacement value:

library(gsubfn)
data$x <- gsubfn("\\w", list(a = "Anton", c = "Chris", f = "Flo", NA), data$x)

giving:

> head(data)
  id                                  x
1  1    Anton > NA > Chris > Anton > NA
2  2           NA > Flo > NA > NA > Flo
3  3 Chris > Anton > NA > Chris > Anton
4  4            NA > NA > Flo > NA > NA
5  5    NA > Chris > Anton > NA > Chris
6  6           Flo > NA > NA > Flo > NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM