I have a very large data frame with two columns called sentence1
and sentence2
. I am trying to make a new column with the words that differ between two sentences, for example:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))
My data frame has the following structure:
ID sentence1 sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six
And my expected result is:
ID sentence1 sentence2 Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six
In RI was trying to split the sentences and after get the elements which differ between the lists, something like:
df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)
But this approach does not work when applying setdiff
...
In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:
from nltk.tokenize import word_tokenize
df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))
And at this point I do not find a function which give me the result i need..
I hope you can help me. Thanks
Here's an R solution.
I've created an exclusiveWords
function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize()
so that it works on all rows of the data.frame at once.
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)
exclusiveWords <- function(x, y){
x <- strsplit(x, " ")[[1]]
y <- strsplit(y, " ")[[1]]
u <- union(x, y)
u <- union(setdiff(u, x), setdiff(u, y))
return(paste0(u, collapse = " "))
}
exclusiveWords <- Vectorize(exclusiveWords)
df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
# sentence1 sentence2 result
# 1 This is sentence one This is the sentence four the four one
# 2 This is sentence two This is the sentence five the five two
# 3 This is sentence three This is the sentence six the six three
Essentially the same as the answer from @SymbolixAU as an apply function.
df$Dif <- apply(df, 1, function(r) {
paste(setdiff(union (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))),
collapse = " ")
})
In Python, you can build a function that treats words in a sentence as a set and calculates the set theoretical exclusive 'or' (a set of words that are in one sentence but not in the other):
df.apply(lambda x:
set(word_tokenize(x['sentence1'])) \
^ set(word_tokenize(x['sentence2'])), axis=1)
The result is a dataframe of sets.
#0 {one, the, four}
#1 {the, two, five}
#2 {the, three, six}
#dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.