简体   繁体   English

使用Python的2个文件之间最常见的单词

[英]most common words between 2 files using Python

I am new to Python and trying to write script which finds most frequent common words between 2 files. 我是Python的新手,并试图编写脚本,找到2个文件之间最常见的常用词。 I am able to find most common words between 2 files separately but not sure how to count lets say top 5 words that are common in both the files? 我能够分别找到2个文件之间最常见的单词,但不知道如何计算让我们说两个文件中常见的前5个单词? Need to find common words and also frequency of those common words between both the files should be most higher as well. 需要找到常用词以及两个文件之间的常用词的频率也应该更高。

import re
from collections import Counter


finalLineLower=''
with open("test3.txt", "r") as hfFile:
        for line in hfFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower += finalLine.lower()
            words1 = finalLineLower.split()

f = open('test2.txt', 'r')
sWords = [line.strip() for line in f]


finalLineLower1=''
with open("test4.txt", "r") as tsFile:
        for line in tsFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower1 += finalLine.lower()
            words = finalLineLower1.split()
#print (words)
mc = Counter(words).most_common()
mc2 = Counter(words1).most_common()

print(len(mc))
print(len(mc2))

Example test3 and test4 files are as below. 示例test3和test4文件如下。 test3: TEST3:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

test4: TEST4:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

Essays can consist of a number of elements, including: literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author. Almost all modern essays are written in prose, but works in verse have been dubbed essays (e.g. Alexander Pope's An Essay on Criticism and An Essay on Man). While brevity usually defines an essay, voluminous works like John Locke's An Essay Concerning Human Understanding and Thomas Malthus's An Essay on the Principle of Population are counterexamples. In some countries (e.g., the United States and Canada), essays have become a major part of formal education. Secondary students are taught structured essay formats to improve their writing skills, and admission essays are often used by universities in selecting applicants and, in the humanities and social sciences, as a way of assessing the performance of students during final exams.

You can simply find the intersection between your Counter objects with & operand : 您可以使用&操作数找到Counter对象之间的交集:

mc = Counter(words)
mc2 = Counter(words1)
total=mc&mc2
mos=total.most_common(N)

Example : 示例:

>>> d1={'a':5,'f':2,'c':1,'h':2,'t':4}
>>> d2={'a':3,'b':2,'e':1,'h':5,'t':6}
>>> c1=Counter(d1)
>>> c2=Counter(d2)
>>> t=c1&c2
>>> t
Counter({'t': 4, 'a': 3, 'h': 2})
>>> t.most_common(2)
[('t', 4), ('a', 3)]

But Note that & returns the minimum Count between your Counters you can also use union | 但请注意&返回您的计数器之间的最小计数,您也可以使用union | that returns the maximum Counts and you can use a simple dict comprehension to get the max counts : 返回最大计数,您可以使用简单的dict理解来获得最大计数:

>>> m=c1|c2
>>> m
Counter({'t': 6, 'a': 5, 'h': 5, 'b': 2, 'f': 2, 'c': 1, 'e': 1})
>>> max={i:j for i,j in m.items() if i in t}
>>> max
{'a': 5, 'h': 5, 't': 6}

And at last if you want the sum of common words you can add your Counters together : 最后,如果你想要常用词的总和,你可以将你的计数器加在一起:

>>> s=Counter(max)+t
>>> s
Counter({'t': 10, 'a': 8, 'h': 7})

This question is ambiguous. 这个问题含糊不清。

You could be asking for the words that are most common among the two files together—so, eg, a word that appears 10000 times in file1 and 1 time in file2 counts as appearing 10001 times. 你可能会要求两个文件中最常见的单词 - 例如,在file1中出现10000次而在file2中出现1次的单词计为出现10001次。 In that case: 在这种情况下:

mc = Counter(words) + Counter(words1) # or Counter(chain(words, words1))
mos = mc.most_common(5)

Or you could be asking for the words that are most common in either file, that also appear at least once in the other file: 或者你可能要求在任一文件中最常见的单词,在另一个文件中至少出现一次:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: max(mc[word], mc1[word]) for word in mc if word in mc1})
mos = mcmerged.most_common(5)

Or the most common among the two files together, but only if they also appear at least once in each file: 或者两个文件中最常见的一样,但前提是它们在每个文件中至少出现一次:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: mc[word] + mc1[word] for word in mc if word in mc1})

There are probably other ways this could be interpreted as well. 可能还有其他方法可以解释。 If you can phrase the rule in unambiguous English, it should be pretty easy to translate it to Python; 如果你能用明确的英语来表达规则,那么将它翻译成Python应该很容易; if you can't do so, it will be impossible. 如果你不能这样做,那将是不可能的。


From your comments, it sounds like you're not actually reading the code in this answer, and trying to use your mc = Counter(words).most_common() instead of the mc = Counter(words) or mc = Counter(words) + Counter(words1) , etc. in this answer. 从你的评论中,听起来你实际上并没有阅读这个答案中的代码,并试图使用你的mc = Counter(words).most_common()而不是mc = Counter(words)mc = Counter(words) + Counter(words1)答案中的mc = Counter(words) + Counter(words1)等。 When you call most_common() on a Counter , you get back a list , not a Counter . 当您在Counter上调用most_common()时,您将返回一个list ,而不是Counter Just… don't do that, do the code that's actually here. 只是......不要这样做,做实际的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM