简体   繁体   English

从大量.txt文件及其频率生成Ngrams(Unigrams,Bigrams等)

[英]Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. 我需要在NLTK中编写一个程序,将一个语料库(大量的txt文件)分成unigrams,bigrams,trigrams,fourgrams和fivegrams。 I have already written code to input my files into the program. 我已经编写了代码来将我的文件输入到程序中。

The input is 300 .txt files written in English and I want the output in form of Ngrams and specially the frequency count. 输入是用英文写的300 .txt文件,我希望以Ngrams的形式输出,特别是频率计数。

I know that NLTK has Bigram and Trigram modules : http://www.nltk.org/_modules/nltk/model/ngram.html 我知道NLTK有Bigram和Trigram模块: http ://www.nltk.org/_modules/nltk/model/ngram.html

but I am not that advanced to enter them into my program. 但我不是那么先进,他们进入我的计划。

input: txt files NOT single sentences 输入:txt文件不是单句

output example: 输出示例:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

My code up to now is: 我的代码到目前为止:

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

Any help what should be done next? 有什么帮助接下来应该做什么?

Just use ntlk.ngrams . 只需使用ntlk.ngrams

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

UPDATE (with pure python): 更新(使用纯python):

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token, 2)
    frequencies += Counter(bigrams)

If efficiency is an issue and you have to build multiple different n-grams, but you want to use pure python I would do: 如果效率是一个问题,你必须建立多个不同的n-gram,但你想使用纯python,我会这样做:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

Usage : 用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK: 〜与NLTK速度相同:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer . 从我之前的回答中重新发布

Here is a simple example using pure Python to generate any ngram : 这是一个使用纯Python生成任何ngram的简单示例:

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]

>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

Ok, so since you asked for an NLTK solution this might not be exactly what you where looking for but, have you considered TextBlob ? 好的,所以既然你要求NLTK解决方案,这可能不是你想要的那个,但是,你考虑过TextBlob吗? It has a NLTK backend but it has a simpler syntax. 它有一个NLTK后端,但它的语法更简单。 It would look something like this: 它看起来像这样:

from textblob import TextBlob

text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)

Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]

You would of course still need to use Counter or some other method to add a count per ngram. 您当然仍然需要使用Counter或其他方法来为每个ngram添加计数。

However, the fastest approach (by far) I have been able to find to both create any ngram you'd like and also count in a single function them stems from this post from 2012 and uses Itertools. 然而,最快的方法(到目前为止)我已经找到了创建任何你想要的ngram并且也算在一个函数中它们源自2012年的这篇文章并使用Itertools。 It's great. 这很棒。

The answer of @hellpander above correct, but not efficient for a very large corpus (I faced difficulties with ~650K documents). @hellpander的答案是正确的,但对于一个非常大的语料库来说效率不高(我遇到了〜650K文件的困难)。 The code would slow down considerably everytime frequencies are updated, due to the expensive lookup of the dictionary as the content grows. 由于随着内容的增长对字典进行昂贵的查找,因此每次更新频率时代码都会显着减慢。 So you will need to have additional buffer variable to help cache the frequencies Counter of @hellpander answer. 因此,您需要有额外的缓冲区变量来帮助缓存@hellpander答案的频率计数器。 Hence, isntead of doing key lookup for a very large frequencies (dictionary) everytime a new document is iterated, you would add it to the temporary, smaller Counter dict. 因此,每次迭代新文档时,不必对非常大的频率(字典)进行密钥查找,而是将其添加到临时的较小的Counter dict中。 Then, after some iterations, it will be add up to the global frequencies. 然后,在一些迭代之后,它将累加到全局频率。 This way it'll be much faster because the huge dictionary lookup is done much less frequently. 这样它会更快,因为大量字典查找的频率要低得多。

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])

maybe it helps. 也许有帮助。 see link 看到链接

import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM