[英]Efficiently count word frequencies in python
我想計算文本文件中所有單詞的頻率。
>>> countInFile('test.txt')
應該返回{'aaa':1, 'bbb': 2, 'ccc':1}
如果目標文本文件是這樣的:
# test.txt
aaa bbb ccc
bbb
我在一些帖子之后用純 python 實現了它。 但是,由於文件大小(> 1GB),我發現純 python 方法是不夠的。
我認為借用sklearn的力量是一個候選人。
如果您讓 CountVectorizer 計算每行的頻率,我想您將通過對每列求和來獲得詞頻。 但是,這聽起來有點間接。
使用python計算文件中單詞的最有效和最直接的方法是什么?
我的(很慢)代碼在這里:
from collections import Counter
def get_term_frequency_in_file(source_file_path):
wordcount = {}
with open(source_file_path) as f:
for line in f:
line = line.lower().translate(None, string.punctuation)
this_wordcount = Counter(line.split())
wordcount = add_merge_two_dict(wordcount, this_wordcount)
return wordcount
def add_merge_two_dict(x, y):
return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
最簡潔的方法是使用 Python 提供的工具。
from future_builtins import map # Only on Python 2
from collections import Counter
from itertools import chain
def countInFile(filename):
with open(filename) as f:
return Counter(chain.from_iterable(map(str.split, f)))
就是這樣。 map(str.split, f)
正在制作一個生成器,它從每一行返回單詞list
。 包裝在chain.from_iterable
中將其轉換為一次生成一個單詞的單個生成器。 Counter
接受一個輸入 iterable 並計算其中的所有唯一值。 最后,您return
一個類似dict
的對象(一個Counter
),該對象存儲所有唯一的單詞及其計數,並且在創建期間,您一次只存儲一行數據和總計數,而不是一次存儲整個文件。
理論上,在 Python 2.7 和 3.1 上,您可能會自己更好地循環鏈接結果並使用dict
或collections.defaultdict(int)
進行計數(因為Counter
是在 Python 中實現的,在某些情況下可能會使其變慢) ,但讓Counter
完成這項工作更簡單,更能自我記錄(我的意思是,整個目標是計數,所以使用Counter
)。 除此之外,在 CPython(參考解釋器)3.2 及更高版本上, Counter
有一個 C 級加速器,用於計算可迭代輸入,其運行速度比您用純 Python 編寫的任何東西都要快。
更新:您似乎想要去除標點符號和不區分大小寫,所以這是我早期代碼的一個變體:
from string import punctuation
def countInFile(filename):
with open(filename) as f:
linewords = (line.translate(None, punctuation).lower().split() for line in f)
return Counter(chain.from_iterable(linewords))
您的代碼運行速度要慢得多,因為它正在創建和銷毀許多小的Counter
和set
對象,而不是.update
每行一次單個Counter
(雖然比我在更新的代碼塊中給出的略慢,但至少算法上的比例因子相似)。
一種有效且准確的記憶方法是利用
scikit
中的 CountVectorizer(用於 ngram 提取)word_tokenize
NLTKnumpy
矩陣和來收集計數collections.Counter
用於收集計數和詞匯一個例子:
import urllib.request
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))
# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())
# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
[出去]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
本質上,您也可以這樣做:
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
X = ngram_vectorizer.fit_transform(data.split('\n'))
vocab = list(ngram_vectorizer.get_feature_names())
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
讓我們timeit
:
import time
start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)
[出去]:
5.257147789001465
請注意, CountVectorizer
也可以使用文件而不是字符串,並且無需將整個文件讀入內存。 在代碼中:
import io
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/input.txt'
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
with io.open(infile, 'r', encoding='utf8') as fin:
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
這應該足夠了。
def countinfile(filename):
d = {}
with open(filename, "r") as fin:
for line in fin:
words = line.strip().split()
for word in words:
try:
d[word] += 1
except KeyError:
d[word] = 1
return d
這是一些基准。 它看起來很奇怪,但最粗糙的代碼獲勝。
[代碼]:
from collections import Counter, defaultdict
import io, time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/file'
def extract_dictionary_sklearn(file_path):
with io.open(file_path, 'r', encoding='utf8') as fin:
ngram_vectorizer = CountVectorizer(analyzer='word')
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
def extract_dictionary_native(file_path):
dictionary = Counter()
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
dictionary.update(line.split())
return dictionary
def extract_dictionary_paddle(file_path):
dictionary = defaultdict(int)
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
for words in line.split():
dictionary[word] +=1
return dictionary
start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start
start = time.time()
extract_dictionary_native(infile)
print time.time() - start
start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start
[出去]:
38.306814909
24.8241138458
12.1182529926
上述基准測試中使用的數據大小(154MB):
$ wc -c /path/to/file
161680851
$ wc -l /path/to/file
2176141
一些注意事項:
sklearn
版本,矢量化器創建 + numpy 操作和轉換為Counter
對象存在開銷Counter
更新版本,看起來Counter.update()
是一個昂貴的操作我沒有解碼從 url 讀取的整個字節,而是處理二進制數據。 因為bytes.translate
期望它的第二個參數是一個字節字符串,所以我 utf-8 編碼punctuation
。 刪除標點符號后,我 utf-8 解碼字節字符串。
函數freq_dist
需要一個可迭代對象。 這就是我通過data.splitlines()
。
from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
data = urlopen(url).read()
def freq_dist(data):
"""
:param data: file-like object opened in binary mode or
sequence of byte strings separated by '\n'
:type data: an iterable sequence
"""
#For readability
#return Counter(word for line in data
# for word in line.translate(
# None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
return Counter(words)
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))
輸出;
elapsed: 0.806480884552
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
似乎dict
比Counter
對象更有效。
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
d = {}
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
for word in words:
d[word] = d.get(word, 0) + 1
return d
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])
輸出;
elapsed: 0.642680168152
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
為了在打開大文件時提高內存效率,您必須只傳遞打開的 url。 但時間也將包括文件下載時間。
data = urlopen(url)
word_dist = freq_dist(data)
跳過 CountVectorizer 和 scikit-learn。
該文件可能太大而無法加載到內存中,但我懷疑 python 字典變得太大了。 對您來說,最簡單的選擇可能是將大文件拆分為 10-20 個較小的文件,並擴展您的代碼以在較小的文件上循環。
你可以試試 sklearn
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data=['i am student','the student suffers a lot']
transformed_data =vectorizer.fit_transform(data)
vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
print (vocab)
結合其他人的觀點和我自己的一些觀點:) 這是我給你的
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
text='''Note that if you use RegexpTokenizer option, you lose
natural language features special to word_tokenize
like splitting apart contractions. You can naively
split on the regex \w+ without any need for the NLTK.
'''
# tokenize
raw = ' '.join(word_tokenize(text.lower()))
tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)
# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common
(所有)
[('note', 1), ('use', 1), ('regexptokenizer', 1), ('option', 1), ('lose', 1), ('natural', 1), ('language', 1), ('features', 1), ('special', 1), ('word', 1), ('tokenize', 1), ('like', 1), ('splitting', 1), ('apart', 1), ('contractions', 1), ('naively', 1), ('split', 1), ('regex', 1), ('without', 1), ('need', 1)]
在效率方面可以做得比這更好,但如果您不太擔心,這段代碼是最好的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.