简体   繁体   中英

R: Match a character vector to text description in dataframe and return value

I have a dataframe with 2 columns and 20 rows of Title of an article and description of that article. I have few key words that I would like to match against these 2 columns. If there is a match with the key word, It should return a Value 1 else 0. I tried simple function like (my.df == "human" ) + 0 . However, it does not work as expected as it cannot find exact match, even though there is a word human somewhere in the title. Any suggestions and help is appreciated. Thank you

Below is example:

my.keyword<- c("human", "lung", "mutation", "chromosome")
# sample df. created from web

my.df 
Title                           Description
Atlas of mutations in         Lung cancer is the leading cause of cancer
                              related mortality in the United States, with
                              an estimated 221,200 new cases and 158,040
                              deaths anticipated in 2015 (ACS 2015​). 

The complexity increases when I would like to search all the char in my.keyword object, without for loop. I would like to get an output if there is a match with human, lung, mutation, chromosome in title...the output result should be 4. If only 3 match out of 4, the result should be 3. Same, in the case of description. Irrespective of repetition of the word, it just should be one value for a match. Thank you

One way to do this is using grepl . Here's some sample data expanding upon yours:

# Create sample data
Title <- c("Atlas of mutations in", 
           "Monkey lungs", 
           "Flatulence and the art of chromosome mutation", 
           "No keywords here")

Description = c("Lung cancer is the leading cause of cancer
                related mortality in the United States, with
                an estimated 221,200 new cases and 158,040
                deaths anticipated in 2015 (ACS 2015).",
                "That was it, the monkeys had had enough
                and began the ferocious flinging of feces 
                about the room as madness broke out and
                everyone started their chromosome mutations. The monkey
                kingdom would rise again",
                "Once upon a time there was a human that
                had trouble with R and sought out stack overflow
                for help",
                "Strange days and strange times for the human race")

my.df <- data.frame(Title = Title,
                    Description = gsub("\n", "", Description))

Here's a method for extracting the presence of your keywords in Description :

fun <- function(x) grepl(x, my.df$Description, ignore.case = T)
keywordsDescrip <- as.data.frame(1*sapply(my.keyword, fun))
keywordsDescrip$sum <- rowSums(keywordsDescrip)

And the output:

> keywordsDescrip
  human lung mutation chromosome sum
1     0    1        0          0   1
2     0    0        1          1   2
3     1    0        0          0   1
4     1    0        0          0   1

Just repeat the above process swapping out my.df$Description for my.df$Title to assess the appearance of your keywords in that field.

my.keyword<- c("human", "lung", "mutation", "chromosome")
txt <- "Human lung cancer due to chromosome mutations is the leading cause of cancer related mortality in the United States, with an estimated 221,200 new cases and 158,040 deaths anticipated in 2015 (ACS 2015​). "
count.kw <- function(txt) sum(sapply(my.keyword, grepl, x=tolower(txt), fixed=TRUE))
count.kw(txt)
# [1] 4

Notice how I "edited" your text to include more than one of the keywords

This works for 1 string, but not a vector of strings, so we have to vectorize the function:

vcount.lw <- Vectorize(count.kw)

Then, create an example:

set.seed(1)
rwords <- function(x) paste(paste(my.keyword[sample(1:4,sample(1:4))], collapse= " "),"blah, blah, blah")
df <- data.frame(Title=sapply(1:10,rwords))

and demonstrate the solution.

vcount.lw(df$Title)
#  [1] 2 4 4 3 2 3 4 2 2 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM