简体   繁体   中英

count the frequency of words after a specific word

I have many tweets as a text.

I would like to know the frequency of words after a specific word. For instance, I have these tweets and I want to know the frequency after "love":

My love is... 
My love is...
the love was...
the love were...

to get this result:

word    next word  frequency

Love    is         2
Love    was        1
Love    were       1  

or to all words

word    next word  frequency

My      Love       2
the     love       2
Love    is         2
Love    was        1
Love    were       1

The following procedure might help.

Step1 (optional): Creating some example data

example <- c("my love is","my love is","banana","apple","the love was","the love were")

This vector looks like

"my love is"    "my love is"    "banana"        "apple"         "the love was"  "the love were"

Step2: Taking all entries of the vector which include the word "love"

ex2 <- example[grep("love",example)]

which gives you

"my love is"    "my love is"    "the love was"  "the love were"

Step3: Constructing a table of the word which comes after the word "love"

ex3 <- table(gsub(".*love","",ex2))

which gives you

   is   was  were 
    2     1     1 

As you are dealing with several word combinations (first X second), I don't see any way to avoid using a loop. The function below should do what you want:

phrase <- c("My love is... ","My love is...","A love was...","the dogs were...")
SPLIT <- matrix(unlist(strsplit(phrase," ")),nrow=length(phrase),byrow=T)
vect <- as.data.frame(cbind(unique(expand.grid(SPLIT[,1],SPLIT[,2])),freq=NA))
to.find <- paste(vect[,1],vect[,2],sep=" ")
for (i in 1:length(to.find)) {
vect[i,3] <- length(grep(to.find[i],phrase))}
vect <- subset(vect,freq>0)
vect

vect
    Var1 Var2 freq
 1    My love    2
 3     A love    1
 16  the dogs    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM