简体   繁体   中英

most common words between 2 files using Python

I am new to Python and trying to write script which finds most frequent common words between 2 files. I am able to find most common words between 2 files separately but not sure how to count lets say top 5 words that are common in both the files? Need to find common words and also frequency of those common words between both the files should be most higher as well.

import re
from collections import Counter


finalLineLower=''
with open("test3.txt", "r") as hfFile:
        for line in hfFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower += finalLine.lower()
            words1 = finalLineLower.split()

f = open('test2.txt', 'r')
sWords = [line.strip() for line in f]


finalLineLower1=''
with open("test4.txt", "r") as tsFile:
        for line in tsFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower1 += finalLine.lower()
            words = finalLineLower1.split()
#print (words)
mc = Counter(words).most_common()
mc2 = Counter(words1).most_common()

print(len(mc))
print(len(mc2))

Example test3 and test4 files are as below. test3:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

test4:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

Essays can consist of a number of elements, including: literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author. Almost all modern essays are written in prose, but works in verse have been dubbed essays (e.g. Alexander Pope's An Essay on Criticism and An Essay on Man). While brevity usually defines an essay, voluminous works like John Locke's An Essay Concerning Human Understanding and Thomas Malthus's An Essay on the Principle of Population are counterexamples. In some countries (e.g., the United States and Canada), essays have become a major part of formal education. Secondary students are taught structured essay formats to improve their writing skills, and admission essays are often used by universities in selecting applicants and, in the humanities and social sciences, as a way of assessing the performance of students during final exams.

You can simply find the intersection between your Counter objects with & operand :

mc = Counter(words)
mc2 = Counter(words1)
total=mc&mc2
mos=total.most_common(N)

Example :

>>> d1={'a':5,'f':2,'c':1,'h':2,'t':4}
>>> d2={'a':3,'b':2,'e':1,'h':5,'t':6}
>>> c1=Counter(d1)
>>> c2=Counter(d2)
>>> t=c1&c2
>>> t
Counter({'t': 4, 'a': 3, 'h': 2})
>>> t.most_common(2)
[('t', 4), ('a', 3)]

But Note that & returns the minimum Count between your Counters you can also use union | that returns the maximum Counts and you can use a simple dict comprehension to get the max counts :

>>> m=c1|c2
>>> m
Counter({'t': 6, 'a': 5, 'h': 5, 'b': 2, 'f': 2, 'c': 1, 'e': 1})
>>> max={i:j for i,j in m.items() if i in t}
>>> max
{'a': 5, 'h': 5, 't': 6}

And at last if you want the sum of common words you can add your Counters together :

>>> s=Counter(max)+t
>>> s
Counter({'t': 10, 'a': 8, 'h': 7})

This question is ambiguous.

You could be asking for the words that are most common among the two files together—so, eg, a word that appears 10000 times in file1 and 1 time in file2 counts as appearing 10001 times. In that case:

mc = Counter(words) + Counter(words1) # or Counter(chain(words, words1))
mos = mc.most_common(5)

Or you could be asking for the words that are most common in either file, that also appear at least once in the other file:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: max(mc[word], mc1[word]) for word in mc if word in mc1})
mos = mcmerged.most_common(5)

Or the most common among the two files together, but only if they also appear at least once in each file:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: mc[word] + mc1[word] for word in mc if word in mc1})

There are probably other ways this could be interpreted as well. If you can phrase the rule in unambiguous English, it should be pretty easy to translate it to Python; if you can't do so, it will be impossible.


From your comments, it sounds like you're not actually reading the code in this answer, and trying to use your mc = Counter(words).most_common() instead of the mc = Counter(words) or mc = Counter(words) + Counter(words1) , etc. in this answer. When you call most_common() on a Counter , you get back a list , not a Counter . Just… don't do that, do the code that's actually here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM