简体   繁体   中英

pattern matching R

ca.df

id    Category
1     Noun
2     Negative
3     Positive
4     adj
5     word

Each term is assigned to more than 1 category, therefore, it corresponds with more than 1 id. In terms.df all the ids are in one column.

terms.df

Terms   id
 Love    1 4 5 3
 Hate    2 4 5
 ice     1 5

id in terms is corresponded with category in ca.df. I want an output like this:

x.df

Category      terms

Noun          ice Love
Negative      Hate
Positive      Love
adj           Hate Love
word          ice Hate Love

How to do this?

Here's a possible data.table / splitstackshape packages solution

library(splitstackshape) ## loads `data.table` package too
terms.df <- cSplit(terms.df, "id", sep = " ", direction = "long")
setkey(terms.df, id)[ca.df, .(Category , Terms = toString(Terms)), by = .EACHI]

#    id Category           Terms
# 1:  1     Noun       Love, ice
# 2:  2 Negative            Hate
# 3:  3 Positive            Love
# 4:  4      adj      Love, Hate
# 5:  5     word Love, Hate, ice

Some explanations

  1. We first split the id column by spaces according to the Terms column
  2. Then we are performing a binary left join between the two data sets on the id column
  3. While joining , we are concatenating the Terms column back according to each join using the by = .EACHI operator which allows us to perform different operations while joinig

A solution using tidyr and dplyr .

library(tidyr)
library(dplyr)
ca.df$id <- as.character(ca.df$id)

terms.df %>% separate(id,into=paste0("V",1:3),sep = " ",extra = "merge") %>%
  gather(var,id,-Terms) %>%
  filter(!is.na(id)) %>%
  left_join(ca.df,by="id") %>%
  select(-var,-id) %>%
  group_by(Category) %>%
  summarize(Terms=paste(Terms,collapse=" "))

Output :

Source: local data frame [4 x 2]

      Category         Terms
    1 Negative          Hate
    2     Noun      Love ice
    3      adj     Love Hate
    4     word ice Love Hate

Data :

ca.df <- read.table(text = 
"id    Category
1     Noun
2     Negative
3     Positive
4     adj
5     word",head=TRUE,stringsAsFactors=FALSE)

terms.df <- read.table(text = 
"Terms   id
Love    '1 4 5'
Hate    '2 4 5'
ice     '1 5'
",head=TRUE,stringsAsFactors=FALSE)

You can use merge to combine based on id

ca.df <- data.frame(id=1:5, Category=c("Noun", "Negative", "Positive", "adj", "word"))
terms.df <- data.frame(Terms=c(rep("Love", 3), rep("Hate", 3), rep("ice", 2)), 
        id = c(1,4,5,2,4,5,1,5))
x.df <- merge(ca.df, terms.df, by="id")
x.df

  id Category Terms
1  1     Noun  Love
2  1     Noun   ice
3  2 Negative  Hate
4  4      adj  Love
5  4      adj  Hate
6  5     word  Love
7  5     word  Hate
8  5     word   ice

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM