How do count and score multiple lists of words to a corpus of multiple documents so you can perform sorting in a few different ways?
sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood',
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood',
for example
colors = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']
corpus = ['i ate a red apple.', 'There are so many colors in the rainbow.', 'the monster was purple and green.', 'the pickle is very green', 'the kid read the book the little red riding hood', 'in the book the wizard of oz there was a yellow brick road.', 'tom has a green thumb and likes working in a garden.' ]
colors = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']
0 1 2 3 4 5 6
do I make a counter
# 0 'i ate a red apple.'
['red': 1, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 1, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
# 1 'There are so many colors in the rainbow.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 1, 'book': 0]
# 2 the monster was purple and green.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 1]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
# 3 'the pickle is very green',
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 1, 'tomato': 0, 'rainbow': 0, 'book': 0]
# 4 'the kid read the book the little red riding hood',
['red': 1 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]
# 5 'in the book the wizard of oz there was a yellow brick road.',
['red': 0, 'blue': 0, 'yellow' : 1, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]
# 6 'tom has a green thumb and likes working in a garden.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
or an array for color and one for things
# colors
0 1 2 3 4 5 6
red 1 0 0 0 1 0 0
blue 0 0 0 0 0 0 0
yellow 0 0 0 0 0 1 0
purple 0 0 1 0 0 0 0
# things
0 1 2 3 4 5 6
apple 1 0 0 0 1 0 0
pickle 0 0 0 1 0 0 0
tomato 0 0 0 0 0 0 0
rainbow 0 0 1 0 0 0 0
book 0 0 0 0 1 1 0
Then find most simular or sort by closest number
sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood',
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood',
Or should I use doc2vec or something completely different?
You can achieve this by iterating on each line and grouping by words to get the count
def words_counter(corpus_parameter, colors_par, things_par):
""" Returns two dataframes with the occurrence of the words in colors_par & things_par
corpus_parameter: list of strings, common language
colors_par: list of words with no spaces or punctuation
things_par: list of words with no spaces or punctuation
"""
colors_count, things_count = [], [] # lists to collect intermediate series
for i, line in enumerate(corpus):
words = pd.Series(
line
.strip(' !?.') # it will remove any spaces or punctuation from left/right of the string
.lower() # use this to count 'red', 'Red', and 'RED' as the same word
.split() # split using spaces (' ') by default, you can provide a different character
) # returns a clean series with all the words
# print(words) # uncomment to see the series
words = words.groupby(words).size() # returns the words as index and the count as values
# print(words) # uncomment to see the series
colors_count.append(words.loc[words.index.isin(colors_par)])
things_count.append(words.loc[words.index.isin(things_par)])
colors_count = (
pd.concat(colors_count, axis=1) # convert list of series to dataframe
.reindex(colors_par) # include colors with zero occurrence
.fillna(0) # get rid of NaNs
.astype(int) # convert from default float to integer
)
things_count = pd.concat(things_count, axis=1).reindex(things_par).fillna(0).astype(int)
print(colors_count)
print(things_count)
return(colors_count, things_count)
Call it with the line
words_counter(corpus, colors, things)
Output
0 1 2 3 4 5 6
red 1 0 0 0 1 0 0
blue 0 0 0 0 0 0 0
yellow 0 0 0 0 0 1 0
purple 0 0 1 0 0 0 0
0 1 2 3 4 5 6
apple 1 0 0 0 0 0 0
pickle 0 0 0 1 0 0 0
tomato 0 0 0 0 0 0 0
rainbow 0 1 0 0 0 0 0
book 0 0 0 0 1 1 0
IIUC, you have a bunch of topics such as colors, things, moods etc and each topic has some keywords. You want to find similarity between sentences based on occurrence of keywords from a given topic at a time.
You can do this in 2 steps -
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
out = cv.fit_transform(corpus).toarray() #apply countvectorizer
#For scalability (because you can have a lot more topics like Mood etc) I am combining all topics first and later ill filter by given topic
combined = colors+things #combine all your topics
c = [(k,v) for k,v in cv.vocabulary_.items() if k in combined] #get indexes for all the items from all topics
cdf = pd.DataFrame(out[:,[i[1] for i in c]], columns=[i[0] for i in c]).T #Filter cv dataframe for all items
print(cdf)
#This results in a keyword occurance dataset with all keywords from all topics
0 1 2 3 4 5 6
red 1 0 0 0 1 0 0
apple 1 0 0 0 0 0 0
rainbow 0 1 0 0 0 0 0
purple 0 0 1 0 0 0 0
pickle 0 0 0 1 0 0 0
book 0 0 0 0 1 1 0
yellow 0 0 0 0 0 1 0
Now, for the next step, filter this by topic (color, or things, etc) and take the cosine similarity of that matrix (normalized dot product). This can be done with this function -
def get_similary_table(topic):
df = cdf.loc[cdf.index.isin(topic)] #filter by topic
cnd = df.values
similarity = cnd.T@cnd #Take dot product to get similarty matrix
dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
return dd
get_similary_table(things)
If you see a single row in this table, the columns with the highest value are the most similar. So if you want the most similar, just take a max, or if you want top 5 then sort and take the top 5 values (and their corresponding columns)
Here is a code for getting the most similar sentence to a given sentence
def get_similar_review(s, topic):
df = cdf.loc[cdf.index.isin(topic)] #filter by topic
cnd = df.values
similarity = cnd.T@cnd #Take dot product to get similarty matrix
np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output
dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value
s = 'i ate a red apple.'
get_similar(s, colors)
#'the kid read the book the little red riding hood'
s = 'the kid read the book the little red riding hood'
get_similar(s, things)
#'in the book the wizard of oz there was a yellow brick road.'
If you dont want to find similarity by a topic then you can simply ignore most of the steps and directly, take the CountVectorized matrix cv, take its dot product to get (sentence * sentence) matrix and get the similarity matrix
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.