简体   繁体   中英

How can I preserve hypenated words when building a text document matrix with the python textmining module?

I have this code below that compares a piece of text to a stop word set and returns a list of words in the text that are not in the stop word set. Then I change the list of words to a string, so I can use it in the textmining module to create a term document matrix.

I have checks in the code that show that hyphenated words are being maintained in the list and in the string, but once I pass them through the TDM part of the code, the hyphenated words are broken up. Is there a way to maintain hyphenated words in the textmining module and TDM?

import re

f= open ("words")  #dictionary
stops = set()
for line in f:
    stops.add(line.strip())

f = open ("azathoth") #Azathoth (1922)
azathoth = list()
for line in f:
    azathoth.extend(re.findall("[A-z\-\']+", line.strip()))

azathothcount = list()
for w in azathoth:
    if w in stops:
        continue
    else:
        azathothcount.append(w)

print azathothcount[1:10]
raw_input('Press Enter...')

azathothstr = ' '.join(azathothcount)
print azathothstr
raw_input('Press Enter...')

import textmining

def termdocumentmatrix_example():
    doc1 = azathothstr

    tdm = textmining.TermDocumentMatrix()
    tdm.add_doc(doc1)

    tdm.write_csv('matrixhp.csv', cutoff=1)

    for row in tdm.rows(cutoff=1):
        print row

raw_input('Press Enter...')
termdocumentmatrix_example()

The textmining package defaults to its own 'simple_tokenize' function when initializing the TermDocumentMatrix class. add_doc() pushes your text through simple_tokenize() before adding it to the tdm.

help(textmining) yields, in part:

class TermDocumentMatrix(__builtin__.object)
 |  Class to efficiently create a term-document matrix.
 |  
 |  The only initialization parameter is a tokenizer function, which should
 |  take in a single string representing a document and return a list of
 |  strings representing the tokens in the document. If the tokenizer
 |  parameter is omitted it defaults to using textmining.simple_tokenize
 |  
 |  Use the add_doc method to add a document (document is a string). Use the
 |  write_csv method to output the current term-document matrix to a csv
 |  file. You can use the rows method to return the rows of the matrix if
 |  you wish to access the individual elements without writing directly to a
 |  file.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, tokenizer=<function simple_tokenize>)
 |
 |  ...
 |
 |simple_tokenize(document)
 |  Clean up a document and split into a list of words.
 |
 |  Converts document (a string) to lowercase and strips out 
 |  everything which is not a lowercase letter.

So you'll have to roll your own tokenizer that does not split on the hyphen, and pass it through when you initialize the TermDocumentMatrix class.

In my mind, it would be best if this process maintained the rest of the functionality of the simple_tokenize() function - minus removing hyphenated words, so you might route the hyphenated words around the results of that function. Below, I've removed the hyphenated words from the document, pushed the remainder through simple_tokenize() and then merged the two lists (hyphenated words + simple_tokenize() results) before adding them to the tdm:

doc1 = 'blah "blah" blahbitty-blah, in-the bloopity blip bleep br-rump! '

import re

def toknzr(txt): 
    hyph_words = re.findall(r'\w+(?:-\w+)+',txt)
    remove = '|'.join(hyph_words)
    regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
    simple = regex.sub("", txt)
    return(hyph_words + textmining.simple_tokenize(simple))

tdm = textmining.TermDocumentMatrix(tokenizer = toknzr)
tdm.add_doc(doc1)

This may not be the most pythonic way to make your own tokenizer (feedback appreciated!), but the main point here is that you'll have to initialize the class with a new tokenizer and not use the default simple_tokenize().

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM