Would it be possible to look at a specific n-grams from the whole list of them and look for it in a list of sentences? For example:
I have the following sentences (from a dataframe column):
example = ['Mary had a little lamb. Jack went up the hill' ,
'Jack went to the beach' ,
'i woke up suddenly' ,
'it was a really bad dream...']
and n-grams (bigrams) got from
word_v = CountVectorizer(ngram_range=(2,2), analyzer='word')
mat = word_v r.fit_transform(df['Example'])
frequencies = sum(mat).toarray()[0]
which generates the output of the n-grams frequency.
I would like to select
within the list above example
.
So, let's say that the most frequent bi-gram is Jack went
, how could I look for it in the example above? Also, if I want to look, not at the most frequent bi-grams but at the hill/beach in the example
, how could I do it?
To select the rows that have the most frequent ngrams in it, you can do:
df.loc[mat.toarray()[:, frequencies==frequencies.max()].astype(bool)]
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
but if two ngrams have the max frequency, you would get all the rows where both are present.
If you want the top/hill 3 and all the rows that have any of them:
top = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, -top:].any(axis=1)])
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
2 i woke up suddenly
3 it was a really bad dream...
#here it is all the rows with the example
hill = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, :hill].any(axis=1)])
Example
1 Jack went to the beach
3 it was a really bad dream...
Finally if you want a specific ngrams:
ng = 'it was'
df.loc[mat.toarray()[:, np.array(word_v.get_feature_names())==ng].astype(bool)]
Example
3 it was a really bad dream...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.