簡體   English   中英

僅計算文本文件Python中的單詞

[英]Count only the words in a text file Python

我必須計算文件中的所有單詞,並創建單詞的直方圖。 我正在使用以下python代碼。

for word in re.split('[,. ]',f2.read()):
    if word not in histogram:
        histogram[word] = 1
    else:
        histogram[word]+=1

f2是我正在讀取的文件,我試圖通過多個定界符來解析該文件,但它仍然無法正常工作。 它計算文件中的所有字符串並制作一個直方圖,但是我只想要單詞。 我得到這樣的結果:

1-1-3:  3

其中“ 1-1-3”是出現3次的字符串。 如何檢查以便只計算實際單詞? 腸衣無所謂。 我還需要重復這個問題,但是需要兩個單詞序列,因此輸出看起來像:

and the: 4

其中“和”是兩個單詞序列,出現4次。 如何將兩個單詞序列組合在一起進行計數?

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk import bigrams
from string import punctuation

# preparatory stuff
>>> tokenizer = RegexpTokenizer(r'[^\W\d]+')
>>> my_string = "this is my input string. 12345 1-2-3-4-5. this is my input"

# single words
>>> tokens = tokenizer.tokenize(my_string)
>>> Counter(tokens)
Counter({'this': 2, 'input': 2, 'is': 2, 'my': 2, 'string': 1})

# word pairs
>>> nltk_bigrams = bigrams(my_string.split())
>>> bigrams_list = [' '.join(x).strip(punctuation) for x in list(nltk_bigrams)]
>>> Counter([x for x in bigrams_list if x.replace(' ','').isalpha()])
Counter({'is my': 2, 'this is': 2, 'my input': 2, 'input string': 1})

假設您要計算字符串中的所有單詞,可以使用defaultdict作為計數器來執行以下操作:

#!/usr/bin/env python3
# coding: utf-8

from collections import defaultdict

# For the sake of simplicty we are using a string instead of a read file
sentence = "The quick brown fox jumps over the lazy dog. THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG. The quick brown fox"

# Specify the word pairs you want to count as a single phrase
special_pairs = [('the', 'quick')]

# Convert sentence / input to lowercase in order to neglect case sensitivity and print lowercase sentence to double-check
sentence = sentence.lower()
print(sentence)


# Split string into single words
word_list = sentence.split(' ')
print(word_list)

# Since we know all the word in our input sentence we have to correct the word_list with our word pairs which need
# to be counted as a single phrase and not two single words
for pair in special_pairs:
    for index, word in enumerate(word_list):
        if pair[0] == word and pair[1] == word_list[index+1]:
            word_list.remove(pair[0])
            word_list.remove(pair[1])
            word_list.append(' '.join([pair[0], pair[1]]))


d = defaultdict(int)
for word in word_list:
    d[word] += 1

print(d.items())

輸出:

the quick brown fox jumps over the lazy dog. the quick brown fox jumps over the lazy dog. the quick brown fox
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox']
dict_items([('lazy', 2), ('dog.', 2), ('fox', 3), ('brown', 3), ('jumps', 2), ('the quick', 3), ('the', 2), ('over', 2)])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM