有效計算python中的詞頻

Question

我想計算文本文件中所有單詞的頻率。

>>> countInFile('test.txt')

應該返回{'aaa':1, 'bbb': 2, 'ccc':1}如果目標文本文件是這樣的：

# test.txt
aaa bbb ccc
bbb

我在一些帖子之后用純 python 實現了它。 但是，由於文件大小（> 1GB），我發現純 python 方法是不夠的。

我認為借用sklearn的力量是一個候選人。

如果您讓 CountVectorizer 計算每行的頻率，我想您將通過對每列求和來獲得詞頻。 但是，這聽起來有點間接。

使用python計算文件中單詞的最有效和最直接的方法是什么？

更新

我的（很慢）代碼在這里：

from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }

Answer 1

最簡潔的方法是使用 Python 提供的工具。

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

就是這樣。 map(str.split, f)正在制作一個生成器，它從每一行返回單詞list 。 包裝在chain.from_iterable中將其轉換為一次生成一個單詞的單個生成器。 Counter接受一個輸入 iterable 並計算其中的所有唯一值。 最后，您return一個類似dict的對象（一個Counter ），該對象存儲所有唯一的單詞及其計數，並且在創建期間，您一次只存儲一行數據和總計數，而不是一次存儲整個文件。

理論上，在 Python 2.7 和 3.1 上，您可能會自己更好地循環鏈接結果並使用dict或collections.defaultdict(int)進行計數（因為Counter是在 Python 中實現的，在某些情況下可能會使其變慢），但讓Counter完成這項工作更簡單，更能自我記錄（我的意思是，整個目標是計數，所以使用Counter ）。除此之外，在 CPython（參考解釋器）3.2 及更高版本上， Counter有一個 C 級加速器，用於計算可迭代輸入，其運行速度比您用純 Python 編寫的任何東西都要快。

更新：您似乎想要去除標點符號和不區分大小寫，所以這是我早期代碼的一個變體：

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

您的代碼運行速度要慢得多，因為它正在創建和銷毀許多小的Counter和set對象，而不是.update每行一次單個Counter （雖然比我在更新的代碼塊中給出的略慢，但至少算法上的比例因子相似）。

Answer 2

一種有效且准確的記憶方法是利用

scikit中的 CountVectorizer（用於 ngram 提取）
word_tokenize NLTK
numpy矩陣和來收集計數
collections.Counter用於收集計數和詞匯

一個例子：

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[出去]：

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

本質上，您也可以這樣做：

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

讓我們timeit ：

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[出去]：

5.257147789001465

請注意， CountVectorizer也可以使用文件而不是字符串，並且無需將整個文件讀入內存。 在代碼中：

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))

Answer 3

這應該足夠了。

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d

Answer 4

這是一些基准。 它看起來很奇怪，但最粗糙的代碼獲勝。

[代碼]：

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start

[出去]：

38.306814909
24.8241138458
12.1182529926

上述基准測試中使用的數據大小（154MB）：

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141

一些注意事項：

使用sklearn版本，矢量化器創建 + numpy 操作和轉換為Counter對象存在開銷
然后原生Counter更新版本，看起來Counter.update()是一個昂貴的操作

Answer 5

我沒有解碼從 url 讀取的整個字節，而是處理二進制數據。 因為bytes.translate期望它的第二個參數是一個字節字符串，所以我 utf-8 編碼punctuation 。 刪除標點符號后，我 utf-8 解碼字節字符串。

函數freq_dist需要一個可迭代對象。 這就是我通過data.splitlines() 。

from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint

url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'

data = urlopen(url).read()

def freq_dist(data):
    """
    :param data: file-like object opened in binary mode or
                 sequence of byte strings separated by '\n'
    :type data: an iterable sequence
    """
    #For readability   
    #return Counter(word for line in data
    #    for word in line.translate(
    #    None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())

    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    return Counter(words)


start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))

輸出;

elapsed: 0.806480884552

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

似乎dict比Counter對象更有效。

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    d = {}
    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    for word in words:
        d[word] = d.get(word, 0) + 1
    return d

start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])

輸出;

elapsed: 0.642680168152

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

為了在打開大文件時提高內存效率，您必須只傳遞打開的 url。 但時間也將包括文件下載時間。

data = urlopen(url)
word_dist = freq_dist(data)

Answer 6

跳過 CountVectorizer 和 scikit-learn。

該文件可能太大而無法加載到內存中，但我懷疑 python 字典變得太大了。 對您來說，最簡單的選擇可能是將大文件拆分為 10-20 個較小的文件，並擴展您的代碼以在較小的文件上循環。

Answer 7

你可以試試 sklearn

from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

    data=['i am student','the student suffers a lot']
    transformed_data =vectorizer.fit_transform(data)
    vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
    print (vocab)

Answer 8

結合其他人的觀點和我自己的一些觀點:) 這是我給你的

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

輸出

（所有）

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

在效率方面可以做得比這更好，但如果您不太擔心，這段代碼是最好的。

有效計算python中的詞頻

問題描述

更新

8 個解決方案

解決方案1
49 已采納 2016-03-08 02:30:21

解決方案2
15 2016-03-10 12:53:51

解決方案3
3 2016-03-08 02:14:34

解決方案4
3 2016-10-06 08:59:16

解決方案5
2 2016-03-11 15:45:00

解決方案6
0 2016-03-08 02:10:43

解決方案7
0 2019-03-01 07:05:04

解決方案8
0 2020-01-26 08:22:36

輸出

有效計算python中的詞頻

問題描述

更新

8 個解決方案

解決方案1 49 已采納 2016-03-08 02:30:21

解決方案2 15 2016-03-10 12:53:51

解決方案3 3 2016-03-08 02:14:34

解決方案4 3 2016-10-06 08:59:16

解決方案5 2 2016-03-11 15:45:00

解決方案6 0 2016-03-08 02:10:43

解決方案7 0 2019-03-01 07:05:04

解決方案8 0 2020-01-26 08:22:36

輸出

解決方案1
49 已采納 2016-03-08 02:30:21

解決方案2
15 2016-03-10 12:53:51

解決方案3
3 2016-03-08 02:14:34

解決方案4
3 2016-10-06 08:59:16

解決方案5
2 2016-03-11 15:45:00

解決方案6
0 2016-03-08 02:10:43

解決方案7
0 2019-03-01 07:05:04

解決方案8
0 2020-01-26 08:22:36