簡體   English   中英

如何做到這一點,這樣我就只能讀取特定單詞的文本文件?

[英]How do I make it so I can read a text file for only specific words?

如何獲取僅讀取文本文件中特定單詞的代碼並顯示單詞和計數(單詞在文本文件中出現的次數)?

from collections import Counter
import re

def openfile(filename):
 fh = open(filename, "r+")
 str = fh.read()
 fh.close()
 return str

def removegarbage(str):
 str = re.sub(r'\W+', ' ', str)
 str = str.lower()
 return str

def getwordbins(words):
 cnt = Counter()
 for word in words:
    cnt[word] += 1
 return cnt

 def main(filename, topwords):
   txt = openfile(filename)
   txt = removegarbage(txt)
   words = txt.split(' ')
   bins = getwordbins(words)
   for key, value in bins.most_common(topwords):
    print key,value

  main('filename.txt', 10)

我認為做那么多功能太復雜了,為什么不在一個函數中做呢?

# def function if desired
# you may have the filepath/specific words etc as parameters

 f = open("filename.txt")
 counter=0
 for line in f:
     # you can remove punctuation, translate them to spaces,
     # now any interesting words will be surrounded by spaces and
     # you can detect them
     line = line.translate(maketrans(".,!? ","     "))
     words = line.split() # splits on any number of whitespaces
     for word in words:
         if word == specificword:
             # of use a list of specific words: 
             # if word in specificwordlist:
             counter+=1
             print word
             # you could also append the words to some list, 
             # create a dictionary etc
 f.close()

生成文件中所有單詞的生成器很方便:

from collections import Counter
import re

def words(filename):
    regex = re.compile(r'\w+')
    with open(filename) as f:
        for line in f:
            for word in regex.findall(line):
                yield word.lower()

然后,要么:

wordcount = Counter(words('filename.txt'))               
for word in ['foo', 'bar']:
    print word, wordcount[word]

要么

words_to_count = set(['foo', 'bar'])
wordcount = Counter(word for word in words('filename.txt') 
                    if word in words_to_count)               
print wordcount.items()

我認為您正在尋找的是一種簡單的字典結構。 這樣,您不僅可以跟蹤所需的單詞,還可以跟蹤它們的計數。

字典將事物存儲為鍵/值對。 因此,例如,您可以使用關鍵字“ alice”(您要查找的單詞,並將其值設置為找到該關鍵字的次數。

檢查字典中是否包含某些內容的最簡單方法是通過Python的in關鍵字。

if 'pie' in words_in_my_dict: do something

有了這些信息,設置字計數器非常容易!

def get_word_counts(words_to_count, filename):
    words = filename.split(' ')
    for word in words:
        if word in words_to_count:
            words_to_count[word] += 1
    return words_to_count

if __name__ == '__main__':

    fake_file_contents = (
        "Alice's Adventures in Wonderland (commonly shortened to "
        "Alice in Wonderland) is an 1865 novel written by English"
        " author Charles Lutwidge Dodgson under the pseudonym Lewis"
        " Carroll.[1] It tells of a girl named Alice who falls "
        "down a rabbit hole into a fantasy world populated by peculiar,"
        " anthropomorphic creatures. The tale plays with logic, giving "
        "the story lasting popularity with adults as well as children."
        "[2] It is considered to be one of the best examples of the literary "
        "nonsense genre,[2][3] and its narrative course and structure, "
        "characters and imagery have been enormously influential[3] in "
        "both popular culture and literature, especially in the fantasy genre."
        )

    words_to_count = {
        'alice' : 0,
        'and' : 0,
        'the' : 0
        }

    print get_word_counts(words_to_count, fake_file_contents)

這給出了輸出:

{'and': 4, 'the': 5, 'alice': 0}

由於dictionary存儲了我們要計數的單詞及其出現的時間。 整個算法只是檢查每個單詞是否在dict ,如果事實證明我們在,則我們對該單詞的值加1

這里閱讀字典

編輯:

如果您想計算所有單詞, 然后找到一個特定的集合,則該詞典仍然很棒(而且很快!)。

我們唯一需要做的更改是首先檢查字典key存在,如果不存在,則將其添加到字典中。

def get_all_word_counts(filename):
    words = filename.split(' ')

    word_counts = {}
    for word in words: 
        if word not in word_counts:     #If not already there
            word_counts[word] = 0   # add it in.
        word_counts[word] += 1          #Increment the count accordingly
    return word_counts

這給出了輸出:

and : 4
shortened : 1
named : 1
popularity : 1
peculiar, : 1
be : 1
populated : 1
is : 2
(commonly : 1
nonsense : 1
an : 1
down : 1
fantasy : 2
as : 2
examples : 1
have : 1
in : 4
girl : 1
tells : 1
best : 1
adults : 1
one : 1
literary : 1
story : 1
plays : 1
falls : 1
author : 1
giving : 1
enormously : 1
been : 1
its : 1
The : 1
to : 2
written : 1
under : 1
genre,[2][3] : 1
literature, : 1
into : 1
pseudonym : 1
children.[2] : 1
imagery : 1
who : 1
influential[3] : 1
characters : 1
Alice's : 1
Dodgson : 1
Adventures : 1
Alice : 2
popular : 1
structure, : 1
1865 : 1
rabbit : 1
English : 1
Lutwidge : 1
hole : 1
Carroll.[1] : 1
with : 2
by : 2
especially : 1
a : 3
both : 1
novel : 1
anthropomorphic : 1
creatures. : 1
world : 1
course : 1
considered : 1
Lewis : 1
Charles : 1
well : 1
It : 2
tale : 1
narrative : 1
Wonderland) : 1
culture : 1
of : 3
Wonderland : 1
the : 5
genre. : 1
logic, : 1
lasting : 1

注意:如您所見,當我們split(' ')文件時,有幾個“失誤”。 具體來說,有些單詞的開頭或結尾都有括號。 您必須在文件處理中考慮到這一點。但是,我讓您自己弄清楚!

這可能就足夠了……不完全是您要的,但最終結果是您想要的(我認為)

interesting_words = ["ipsum","dolor"]

some_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec viverra consectetur sapien, sed posuere sem rhoncus quis. Mauris sit amet ligula et nulla ultrices commodo sed sit amet odio. Nullam vel lobortis nunc. Donec semper sem ut est convallis posuere adipiscing eros lobortis. Nullam tempus rutrum nulla vitae pretium. Proin ut neque id nisi semper faucibus. Sed sodales magna faucibus lacus tristique ornare.
"""

d = Counter(some_text.split())
final_list = filter(lambda item:item[0] in interesting_words,d.items())

但是它的復雜性並不出色,因此可能需要花費一些時間處理大文件和/或大列表的“ interesting_words”

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM