简体   繁体   中英

parse out string, set it as a factor column in R data.table

I can not really find an elegant way achieving this, please help.

I have a DT data.table:

name,value
"lorem pear ipsum",4
"apple ipsum lorem",2
"lorem ipsum plum",6

And based on a list Fruits <- c("pear", "apple", "plum") I'd like to create a factor type column.

name,value,factor
"lorem pear ipsum",4,"pear"
"apple ipsum lorem",2,"apple"
"lorem ipsum plum",6,"plum"

I guess that's basic, but I'm kinda stuck, this is how far I got:

DT[grep("apple", name, ignore.case=TRUE), factor := as.factor("apple")]

Thanks in advance.

You can vectorize this with regular expressions, eg by using gsub() :

Set up the data:

strings <- c("lorem pear ipsum", "apple ipsum lorem", "lorem ipsum plum")
fruit <- c("pear", "apple", "plum")

Now create a regular expression

ptn <- paste0(".*(", paste(fruit, collapse="|"), ").*")
gsub(ptn, "\\1", strings)
[1] "pear"  "apple" "plum" 

The regular expression works by separating each search element with | , embedded inside parentheses:

ptn
[1] ".*(pear|apple|plum).*"

To do this inside a data table, as per your question is then as simple as:

library(data.table)
DT <- data.table(name=strings, value=c(4, 2, 6))
DT[, factor:=gsub(ptn, "\\1", strings)]
DT

                name value factor
1:  lorem pear ipsum     4   pear
2: apple ipsum lorem     2  apple
3:  lorem ipsum plum     6   plum

I don't know if there is a more "data.table" way to do it, but you can try this:

DT[, factor := sapply(Fruits, function(x) Fruits[grep(x, name, ignore.case=TRUE)])]
DT
#                 name value factor
# 1:  lorem pear ipsum     4   pear
# 2: apple ipsum lorem     2  apple
# 3:  lorem ipsum plum     6   plum

Here is my coded solution. The hard part is getting the matched string from regex . The best general solution (that finds whatever is matched to any regular expression) I know of is the regexec and regmatches combination (see below).

# Create the data frame
name <- c("lorem pear ipsum", "apple ipsum lorem", "lorem ipsum plum")
value <- c(4,2,6)
DT <- data.frame(name=name, value=value, stringsAsFactors=FALSE)

# Create the regular expression
Fruits <- c("pear", "apple", "plum")
myRegEx <- paste(Fruits, collapse = "|")

# Find the matches
r <- regexec(myRegEx, DT$name, ignore.case = TRUE)
matches <- regmatches(DT$name, r)

# Extract the matches, convert to factors
factor <- sapply(matches, function(x) as.factor(x[[1]]))

# Add to data frame
DT$factor <- factor

This is probably a longer solution than you wanted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM