简体   繁体   中英

Find and sort most similar to a list of specific words to a corpus of documents

How do count and score multiple lists of words to a corpus of multiple documents so you can perform sorting in a few different ways?

  1. find a doc in corpus and find and sort most similar to words in a list
sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood', 
  1. Also be able to find the closest documents to a given document .
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood', 

for example

colors  = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']

corpus = ['i ate a red apple.', 'There are so many colors in the rainbow.', 'the monster was purple and green.', 'the pickle is very green', 'the kid read the book the little red riding hood', 'in the book the wizard of oz there was a yellow brick road.', 'tom has a green thumb and likes working in a garden.' ]

colors  = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']
 
     0    1    2    3    4    5    6

do I make a counter

# 0 'i ate a red apple.'
['red': 1, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 1, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]

# 1 'There are so many colors in the rainbow.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 1, 'book': 0]

# 2 the monster was purple and green.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 1]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]

# 3 'the pickle is very green', 
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 1, 'tomato': 0, 'rainbow': 0, 'book': 0]

# 4 'the kid read the book the little red riding hood', 
['red': 1 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]

# 5 'in the book the wizard of oz there was a yellow brick road.', 
['red': 0, 'blue': 0, 'yellow' : 1, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]

# 6 'tom has a green thumb and likes working in a garden.' 
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]

or an array for color and one for things

# colors
         0    1    2    3    4    5    6
red      1    0    0    0    1    0    0
blue     0    0    0    0    0    0    0
yellow   0    0    0    0    0    1    0
purple   0    0    1    0    0    0    0
# things
          0    1    2    3    4    5    6
apple     1    0    0    0    1    0    0
pickle    0    0    0    1    0    0    0
tomato    0    0    0    0    0    0    0
rainbow   0    0    1    0    0    0    0
book      0    0    0    0    1    1    0

Then find most simular or sort by closest number

sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood', 
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood', 

Or should I use doc2vec or something completely different?

You can achieve this by iterating on each line and grouping by words to get the count

def words_counter(corpus_parameter, colors_par, things_par):
    """ Returns two dataframes with the occurrence of the words in colors_par & things_par
    corpus_parameter: list of strings, common language
    colors_par: list of words with no spaces or punctuation
    things_par: list of words with no spaces or punctuation
    """
    colors_count, things_count = [], [] # lists to collect intermediate series
    for i, line in enumerate(corpus):
        words = pd.Series(
            line
            .strip(' !?.') # it will remove any spaces or punctuation from left/right of the string
            .lower() # use this to count 'red', 'Red', and 'RED' as the same word
            .split() # split using spaces (' ') by default, you can provide a different character
        ) # returns a clean series with all the words
        # print(words) # uncomment to see the series
        words = words.groupby(words).size() # returns the words as index and the count as values
        # print(words) # uncomment to see the series
        colors_count.append(words.loc[words.index.isin(colors_par)])
        things_count.append(words.loc[words.index.isin(things_par)])
        
    colors_count = (
        pd.concat(colors_count, axis=1) # convert list of series to dataframe
        .reindex(colors_par) # include colors with zero occurrence
        .fillna(0) # get rid of NaNs
        .astype(int) # convert from default float to integer
    )
    things_count = pd.concat(things_count, axis=1).reindex(things_par).fillna(0).astype(int)
        
    print(colors_count)
    print(things_count)
    return(colors_count, things_count)

Call it with the line

words_counter(corpus, colors, things)

Output

        0  1  2  3  4  5  6
red     1  0  0  0  1  0  0
blue    0  0  0  0  0  0  0
yellow  0  0  0  0  0  1  0
purple  0  0  1  0  0  0  0

         0  1  2  3  4  5  6
apple    1  0  0  0  0  0  0
pickle   0  0  0  1  0  0  0
tomato   0  0  0  0  0  0  0
rainbow  0  1  0  0  0  0  0
book     0  0  0  0  1  1  0

IIUC, you have a bunch of topics such as colors, things, moods etc and each topic has some keywords. You want to find similarity between sentences based on occurrence of keywords from a given topic at a time.

You can do this in 2 steps -

  1. Fit a count vectorizer to get word occurrences for all unique words
  2. Filter it for only the keywords present in the topic
  3. Take a dot product between the word occurences for that topic (sentence * topic) dot (topic * sentence) to get a (sentence * sentence) matrix which is the same as cosine similarity between the 2 sentences for that topic (non-normalized)
  4. Go a specific row and get the sentence with the highest similarity score in that row (other than the same sentence)
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
out = cv.fit_transform(corpus).toarray() #apply countvectorizer

#For scalability (because you can have a lot more topics like Mood etc) I am combining all topics first and later ill filter by given topic
combined = colors+things  #combine all your topics

c = [(k,v) for k,v in cv.vocabulary_.items() if k in combined] #get indexes for all the items from all topics

cdf = pd.DataFrame(out[:,[i[1] for i in c]], columns=[i[0] for i in c]).T  #Filter cv dataframe for all items

print(cdf)
#This results in a keyword occurance dataset with all keywords from all topics
         0  1  2  3  4  5  6
red      1  0  0  0  1  0  0
apple    1  0  0  0  0  0  0
rainbow  0  1  0  0  0  0  0
purple   0  0  1  0  0  0  0
pickle   0  0  0  1  0  0  0
book     0  0  0  0  1  1  0
yellow   0  0  0  0  0  1  0

Now, for the next step, filter this by topic (color, or things, etc) and take the cosine similarity of that matrix (normalized dot product). This can be done with this function -

def get_similary_table(topic):
    df = cdf.loc[cdf.index.isin(topic)]  #filter by topic
    cnd = df.values
    similarity = cnd.T@cnd #Take dot product to get similarty matrix
    dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
    return dd

get_similary_table(things)

在此处输入图像描述

If you see a single row in this table, the columns with the highest value are the most similar. So if you want the most similar, just take a max, or if you want top 5 then sort and take the top 5 values (and their corresponding columns)

Here is a code for getting the most similar sentence to a given sentence

def get_similar_review(s, topic):
    df = cdf.loc[cdf.index.isin(topic)] #filter by topic
    cnd = df.values
    similarity = cnd.T@cnd #Take dot product to get similarty matrix
    np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output
    dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
    return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value
s = 'i ate a red apple.'
get_similar(s, colors)

#'the kid read the book the little red riding hood'
s = 'the kid read the book the little red riding hood'
get_similar(s, things)

#'in the book the wizard of oz there was a yellow brick road.'

If you dont want to find similarity by a topic then you can simply ignore most of the steps and directly, take the CountVectorized matrix cv, take its dot product to get (sentence * sentence) matrix and get the similarity matrix

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM