简体   繁体   中英

Extract the words that differ between two sentences

I have a very large data frame with two columns called sentence1 and sentence2 . I am trying to make a new column with the words that differ between two sentences, for example:

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))

My data frame has the following structure:

ID    sentence1                    sentence2
 1     This is sentence one         This is the sentence four
 2     This is sentence two         This is the sentence five
 3     This is sentence three       This is the sentence six

And my expected result is:

ID    sentence1        sentence2     Expected_Result
 1     This is ...      This is ...   one the four 
 2     This is ...      This is ...   two the five
 3     This is ...      This is ...   three the six

In RI was trying to split the sentences and after get the elements which differ between the lists, something like:

df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

But this approach does not work when applying setdiff ...

In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:

from nltk.tokenize import word_tokenize

df['tokensS1'] = df.sentence1.apply(lambda x:  word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x:  word_tokenize(x))

And at this point I do not find a function which give me the result i need..

I hope you can help me. Thanks

Here's an R solution.

I've created an exclusiveWords function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize() so that it works on all rows of the data.frame at once.

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)

exclusiveWords <- function(x, y){
    x <- strsplit(x, " ")[[1]]
    y <- strsplit(y, " ")[[1]]
    u <- union(x, y)
    u <- union(setdiff(u, x), setdiff(u, y))
    return(paste0(u, collapse = " "))
}

exclusiveWords <- Vectorize(exclusiveWords)

df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
#                sentence1                 sentence2        result
# 1   This is sentence one This is the sentence four  the four one
# 2   This is sentence two This is the sentence five  the five two
# 3 This is sentence three  This is the sentence six the six three

Essentially the same as the answer from @SymbolixAU as an apply function.

df$Dif  <-  apply(df, 1, function(r) {
  paste(setdiff(union    (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
                intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))), 
        collapse = " ")
})

In Python, you can build a function that treats words in a sentence as a set and calculates the set theoretical exclusive 'or' (a set of words that are in one sentence but not in the other):

df.apply(lambda x:  
            set(word_tokenize(x['sentence1'])) \
          ^ set(word_tokenize(x['sentence2'])), axis=1)

The result is a dataframe of sets.

#0     {one, the, four}
#1     {the, two, five}
#2    {the, three, six}
#dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM