简体   繁体   中英

Find most common multi words in an input file in Python

Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, us capitol, etc.". Words that occur together the most.

import re
from collections import Counter

with open('full.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word for word in words]

word_counts = Counter(cap_words)

for k, v in word_counts.most_common():
    print(k, v)

I have this currently, however, this only find one word. How do I find multiple words?

What you're looking for is a way to count bigrams (strings containing two words).

The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!

import nltk
from nltk import bigrams
from collections import Counter

tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

What you call mutliwords (there is no such thing) is actually called bigrams. You can get a list of bigrams from a list of words by zipping it with itself with a displacement:

bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]

PS NLTK would be indeed a better tool to get bigrams.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM