Find most common multi words in an input file in Python

Question

Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, us capitol, etc.". Words that occur together the most.

import re
from collections import Counter

with open('full.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word for word in words]

word_counts = Counter(cap_words)

for k, v in word_counts.most_common():
    print(k, v)

I have this currently, however, this only find one word. How do I find multiple words?

Answer 1

What you're looking for is a way to count bigrams (strings containing two words).

The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!

import nltk
from nltk import bigrams
from collections import Counter

tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

Answer 2

What you call mutliwords (there is no such thing) is actually called bigrams. You can get a list of bigrams from a list of words by zipping it with itself with a displacement:

bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]

PS NLTK would be indeed a better tool to get bigrams.

Find most common multi words in an input file in Python

Question

2 answers

solution1
3 ACCPTED 2021-05-18 18:28:25

solution2
0 2021-05-18 18:29:10

Find most common multi words in an input file in Python

Question

2 answers

solution1 3 ACCPTED 2021-05-18 18:28:25

solution2 0 2021-05-18 18:29:10

solution1
3 ACCPTED 2021-05-18 18:28:25

solution2
0 2021-05-18 18:29:10