简体   繁体   中英

Is there a function in R tidyverse to categorize character values of a column based on key words and assign a category?

For example:

dataframe 1 has:

Keyword <- c("dog", "cat", "tiger", "cheetah", "man")
Category <- c("walk", "house", "jungle", "fast", "office")

and I have a second dataframe 2 with a column that has description:

description examples can be <- c("dog is barking", "cat is purring","tiger is hunting", 
"cheetah is running", "man is working")

I want to write a function that will search the description column of dataframe 2 as per the specific keywords in dataframe 1, and then give out a category. How do I do this using tidyverse? thanks!

This may be helpful to you:

library(dplyr)

df2 %>%
  rowwise() %>%
  mutate(keyword = first(unlist(strsplit(des, "\\s+", perl = TRUE)))) %>%
  left_join(df, by = c("keyword" = "Keyword"))

# A tibble: 5 x 3
# Rowwise: 
  des                keyword Category
  <chr>              <chr>   <chr>   
1 dog is barking     dog     walk    
2 cat is purring     cat     house   
3 tiger is hunting   tiger   jungle  
4 cheetah is running cheetah fast    
5 man is working     man     office  

Or we can make use of match function instead of left_join and set the nomatch argument to NA_character in case of not being a match. I prefer this solution:

df2 %>%
  rowwise() %>%
  mutate(keyword = first(unlist(strsplit(des, "\\s+", perl = TRUE))), 
         cat = df$Category[match(keyword, df$Keyword, nomatch = NA_character_)])

# A tibble: 5 x 3
# Rowwise: 
  des                keyword cat   
  <chr>              <chr>   <chr> 
1 dog is barking     dog     walk  
2 cat is purring     cat     house 
3 tiger is hunting   tiger   jungle
4 cheetah is running cheetah fast  
5 man is working     man     office

Data

> dput(df2)
structure(list(des = c("dog is barking", "cat is purring", "tiger is hunting", 
"cheetah is running", "man is working")), row.names = c(NA, -5L
), class = "data.frame")

> dput(df)
structure(list(Category = c("walk", "house", "jungle", "fast", 
"office"), Keyword = c("dog", "cat", "tiger", "cheetah", "man"
)), class = "data.frame", row.names = c(NA, -5L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM