简体   繁体   中英

Look for n-grams into texts

Would it be possible to look at a specific n-grams from the whole list of them and look for it in a list of sentences? For example:

I have the following sentences (from a dataframe column):

example = ['Mary had a little lamb. Jack went up the hill' , 
           'Jack went to the beach' ,    
           'i woke up suddenly' ,
           'it was a really bad dream...']

and n-grams (bigrams) got from

word_v = CountVectorizer(ngram_range=(2,2), analyzer='word')
mat = word_v r.fit_transform(df['Example'])
frequencies = sum(mat).toarray()[0]

which generates the output of the n-grams frequency.

I would like to select

  • the most frequent bi-grams
  • a bi-gram selected manually

within the list above example .

So, let's say that the most frequent bi-gram is Jack went , how could I look for it in the example above? Also, if I want to look, not at the most frequent bi-grams but at the hill/beach in the example , how could I do it?

To select the rows that have the most frequent ngrams in it, you can do:

df.loc[mat.toarray()[:, frequencies==frequencies.max()].astype(bool)]
                                         Example
0  Mary had a little lamb. Jack went up the hill
1                         Jack went to the beach

but if two ngrams have the max frequency, you would get all the rows where both are present.

If you want the top/hill 3 and all the rows that have any of them:

top = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, -top:].any(axis=1)])
                                         Example
0  Mary had a little lamb. Jack went up the hill
1                         Jack went to the beach
2                             i woke up suddenly
3                   it was a really bad dream...
#here it is all the rows with the example

hill = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, :hill].any(axis=1)])
                        Example
1        Jack went to the beach
3  it was a really bad dream...

Finally if you want a specific ngrams:

ng = 'it was'
df.loc[mat.toarray()[:, np.array(word_v.get_feature_names())==ng].astype(bool)]
                        Example
3  it was a really bad dream...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM