簡體   English   中英

使用Python的2個文件之間最常見的單詞

[英]most common words between 2 files using Python

我是Python的新手,並試圖編寫腳本,找到2個文件之間最常見的常用詞。 我能夠分別找到2個文件之間最常見的單詞,但不知道如何計算讓我們說兩個文件中常見的前5個單詞? 需要找到常用詞以及兩個文件之間的常用詞的頻率也應該更高。

import re
from collections import Counter


finalLineLower=''
with open("test3.txt", "r") as hfFile:
        for line in hfFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower += finalLine.lower()
            words1 = finalLineLower.split()

f = open('test2.txt', 'r')
sWords = [line.strip() for line in f]


finalLineLower1=''
with open("test4.txt", "r") as tsFile:
        for line in tsFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower1 += finalLine.lower()
            words = finalLineLower1.split()
#print (words)
mc = Counter(words).most_common()
mc2 = Counter(words1).most_common()

print(len(mc))
print(len(mc2))

示例test3和test4文件如下。 TEST3:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

TEST4:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

Essays can consist of a number of elements, including: literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author. Almost all modern essays are written in prose, but works in verse have been dubbed essays (e.g. Alexander Pope's An Essay on Criticism and An Essay on Man). While brevity usually defines an essay, voluminous works like John Locke's An Essay Concerning Human Understanding and Thomas Malthus's An Essay on the Principle of Population are counterexamples. In some countries (e.g., the United States and Canada), essays have become a major part of formal education. Secondary students are taught structured essay formats to improve their writing skills, and admission essays are often used by universities in selecting applicants and, in the humanities and social sciences, as a way of assessing the performance of students during final exams.

您可以使用&操作數找到Counter對象之間的交集:

mc = Counter(words)
mc2 = Counter(words1)
total=mc&mc2
mos=total.most_common(N)

示例:

>>> d1={'a':5,'f':2,'c':1,'h':2,'t':4}
>>> d2={'a':3,'b':2,'e':1,'h':5,'t':6}
>>> c1=Counter(d1)
>>> c2=Counter(d2)
>>> t=c1&c2
>>> t
Counter({'t': 4, 'a': 3, 'h': 2})
>>> t.most_common(2)
[('t', 4), ('a', 3)]

但請注意&返回您的計數器之間的最小計數,您也可以使用union | 返回最大計數,您可以使用簡單的dict理解來獲得最大計數:

>>> m=c1|c2
>>> m
Counter({'t': 6, 'a': 5, 'h': 5, 'b': 2, 'f': 2, 'c': 1, 'e': 1})
>>> max={i:j for i,j in m.items() if i in t}
>>> max
{'a': 5, 'h': 5, 't': 6}

最后,如果你想要常用詞的總和,你可以將你的計數器加在一起:

>>> s=Counter(max)+t
>>> s
Counter({'t': 10, 'a': 8, 'h': 7})

這個問題含糊不清。

你可能會要求兩個文件中最常見的單詞 - 例如,在file1中出現10000次而在file2中出現1次的單詞計為出現10001次。 在這種情況下:

mc = Counter(words) + Counter(words1) # or Counter(chain(words, words1))
mos = mc.most_common(5)

或者你可能要求在任一文件中最常見的單詞,在另一個文件中至少出現一次:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: max(mc[word], mc1[word]) for word in mc if word in mc1})
mos = mcmerged.most_common(5)

或者兩個文件中最常見的一樣,但前提是它們在每個文件中至少出現一次:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: mc[word] + mc1[word] for word in mc if word in mc1})

可能還有其他方法可以解釋。 如果你能用明確的英語來表達規則,那么將它翻譯成Python應該很容易; 如果你不能這樣做,那將是不可能的。


從你的評論中,聽起來你實際上並沒有閱讀這個答案中的代碼,並試圖使用你的mc = Counter(words).most_common()而不是mc = Counter(words)mc = Counter(words) + Counter(words1)答案中的mc = Counter(words) + Counter(words1)等。 當您在Counter上調用most_common()時,您將返回一個list ,而不是Counter 只是......不要這樣做,做實際的代碼。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM