简体   繁体   中英

Bi-grams in python with lots of txt files

I have a corpus which includes 70,429 files(296.5 mb). I try to find bi-grams by using whole corpus. I have written the following code;

allFiles = ""
for dirName in os.listdir(rootDirectory):
     for subDir in os.listdir(dirName):
         for fileN in os.listdir(subDir):
             FText = codecs.open(fileN, encoding="'iso8859-9'")
             PText = FText.read()
             allFiles += PText
tokens = allFiles.split()
finder = BigramCollocationFinder.from_words(tokens, window_size = 3)
finder.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.most_common(100):
    print(k,v)

There is a root directory and the root directory includes subdirectories and each subdirectory includes numerous files. What I have done is;

I read all of the files one by and add the context into to the string called allFiles . Eventually, I split the string into tokens and call the relevant bi-gram functions. The problem is;

I ran the program for a day and couldn't get any results. Is there a more efficient way to find bigrams within a corpus which includes lots of files?

Any advice and suggestions will be greatly appreciated. Thanks in advance.

By trying to read a huge corpus into memory at once, you're blowing out your memory, forcing a lot of swap use, and slowing everything down.

The NLTK provides various "corpus readers" that can return your words one by one, so that the complete corpus is never stored in memory at the same time. This might work if I understand your corpus layout right:

from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader(rootDirectory, "*/*/*", encoding="iso8859-9")
finder = BigramCollocationFinder.from_words(reader.words(), window_size = 3)
finder.apply_freq_filter(2) # Continue processing as before
...

Addendum: Your approach has a bug: You're taking trigrams that span from the end of one document to the beginning of the next... that's nonsense you want to get rid of. I recommend the following variant, which collects trigrams from each document separately.

document_streams = (reader.words(fname) for fname in reader.fileids())
BigramCollocationFinder.default_ws = 3
finder = BigramCollocationFinder.from_documents(document_streams)

Consider parallelizing your process with Python's "Multiprocessing" thread pool ( https://docs.python.org/2/library/multiprocessing.html ), emitting a dictionary with {word : count} for each file in the corpus into some shared list. After the worker pool completes, merge the dictionaries before filtering by the number of word appearances.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM