简体   繁体   中英

how can I count the specific bigram words?

I want to find and count the specific bigram words such as "red apple" in the text file. I already made the text file to the word list, so I couldn't use regex to count the whole phrase. (ie bigram) ( or can I ? )

How can I count the specific bigram in the text file? not using nltk or other module... regex can be a solution?

Why you have made text file into list. Also it's not memory efficient. Instead of text you can use file.read() method directly.

import re

text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']

for i in bigram:
    print 'Found', i, len(re.findall(i, text))

out:

Found red apples 2
Found green apples 1

Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? In the latter case have a look at NLTK collocations module . You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such.

And think of this: why and how have you turned the lines to list of words? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. Which, again, is why NLTK is what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM