简体   繁体   English

使用python textmining模块构建文本文档矩阵时,如何保留连词?

[英]How can I preserve hypenated words when building a text document matrix with the python textmining module?

I have this code below that compares a piece of text to a stop word set and returns a list of words in the text that are not in the stop word set. 我在下面的这段代码中将一段文本与停用词集进行比较,并返回文本中不在停用词集中的单词列表。 Then I change the list of words to a string, so I can use it in the textmining module to create a term document matrix. 然后,我将单词列表更改为字符串,以便可以在文本挖掘模块中使用它来创建术语文档矩阵。

I have checks in the code that show that hyphenated words are being maintained in the list and in the string, but once I pass them through the TDM part of the code, the hyphenated words are broken up. 我在代码中进行了检查,这些代码表明连字在列表和字符串中得到维护,但是一旦我将它们传递给代码的TDM部分,连字就会被分解。 Is there a way to maintain hyphenated words in the textmining module and TDM? 有没有办法在文本挖掘模块和TDM中维护带连字符的单词?

import re

f= open ("words")  #dictionary
stops = set()
for line in f:
    stops.add(line.strip())

f = open ("azathoth") #Azathoth (1922)
azathoth = list()
for line in f:
    azathoth.extend(re.findall("[A-z\-\']+", line.strip()))

azathothcount = list()
for w in azathoth:
    if w in stops:
        continue
    else:
        azathothcount.append(w)

print azathothcount[1:10]
raw_input('Press Enter...')

azathothstr = ' '.join(azathothcount)
print azathothstr
raw_input('Press Enter...')

import textmining

def termdocumentmatrix_example():
    doc1 = azathothstr

    tdm = textmining.TermDocumentMatrix()
    tdm.add_doc(doc1)

    tdm.write_csv('matrixhp.csv', cutoff=1)

    for row in tdm.rows(cutoff=1):
        print row

raw_input('Press Enter...')
termdocumentmatrix_example()

The textmining package defaults to its own 'simple_tokenize' function when initializing the TermDocumentMatrix class. 初始化TermDocumentMatrix类时,文本挖掘程序包默认为其自身的'simple_tokenize'函数。 add_doc() pushes your text through simple_tokenize() before adding it to the tdm. add_doc()将您的文本通过simple_tokenize()推送,然后再将其添加到tdm。

help(textmining) yields, in part: 帮助(文本挖掘)部分产生:

class TermDocumentMatrix(__builtin__.object)
 |  Class to efficiently create a term-document matrix.
 |  
 |  The only initialization parameter is a tokenizer function, which should
 |  take in a single string representing a document and return a list of
 |  strings representing the tokens in the document. If the tokenizer
 |  parameter is omitted it defaults to using textmining.simple_tokenize
 |  
 |  Use the add_doc method to add a document (document is a string). Use the
 |  write_csv method to output the current term-document matrix to a csv
 |  file. You can use the rows method to return the rows of the matrix if
 |  you wish to access the individual elements without writing directly to a
 |  file.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, tokenizer=<function simple_tokenize>)
 |
 |  ...
 |
 |simple_tokenize(document)
 |  Clean up a document and split into a list of words.
 |
 |  Converts document (a string) to lowercase and strips out 
 |  everything which is not a lowercase letter.

So you'll have to roll your own tokenizer that does not split on the hyphen, and pass it through when you initialize the TermDocumentMatrix class. 因此,您必须滚动自己的不会在连字符上拆分的令牌生成器,并在初始化TermDocumentMatrix类时将其传递通过。

In my mind, it would be best if this process maintained the rest of the functionality of the simple_tokenize() function - minus removing hyphenated words, so you might route the hyphenated words around the results of that function. 在我看来,最好是此过程保留simple_tokenize()函数的其余功能-减去删除带连字符的单词,以便您可以在该函数的结果周围传送带连字符的单词。 Below, I've removed the hyphenated words from the document, pushed the remainder through simple_tokenize() and then merged the two lists (hyphenated words + simple_tokenize() results) before adding them to the tdm: 下面,我从文档中删除了带连字符的单词,将其余部分通过simple_tokenize()推送,然后合并了两个列表(带连字符的单词+ simple_tokenize()结果),然后将它们添加到tdm中:

doc1 = 'blah "blah" blahbitty-blah, in-the bloopity blip bleep br-rump! '

import re

def toknzr(txt): 
    hyph_words = re.findall(r'\w+(?:-\w+)+',txt)
    remove = '|'.join(hyph_words)
    regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
    simple = regex.sub("", txt)
    return(hyph_words + textmining.simple_tokenize(simple))

tdm = textmining.TermDocumentMatrix(tokenizer = toknzr)
tdm.add_doc(doc1)

This may not be the most pythonic way to make your own tokenizer (feedback appreciated!), but the main point here is that you'll have to initialize the class with a new tokenizer and not use the default simple_tokenize(). 这可能不是制作自己的令牌生成器的最Python方式(赞赏反馈!),但是这里的要点是,您必须使用新的令牌生成器初始化该类,而不要使用默认的simple_tokenize()。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 scikit-learn 的术语文档矩阵时,如何防止带连字符的单词被标记化? - How can I prevent words with hyphens from being tokenized when using scikit-learn`s term document matrix? 我如何从python中的txt文档中删除单词 - How can i delete words from a txt document in python 如何在 Python 中将 PCA 用于术语文档矩阵? - How can I use the PCA for a term-document matrix in Python? 使用 Python/Numpy 中的单词构建转换矩阵 - Building a Transition Matrix using words in Python/Numpy Python:如何在字符串中的某些单词之间找到文本? - Python: How can I find text between certain words in a string? 如何在 python 模块 docx 中导入文档? - how can i import Document in python module docx? 在阅读 pandas 中的 csv 时,当它们也是分隔符时,如何保留文本中的逗号? - How can I preserve the commas in the text when they're also delimiters when reading csv in pandas? 如何使用 Python 计算文本文档中的唯一单词(没有特殊字符/大小写干扰) - How can you use Python to count the unique words (without special characters/ cases interfering) in a text document python中的文本图形句子 - textmining graph sentences in python 如何纠正文本文档中的拼写错误(适用于单个单词) - How can I correct spelling mistakes in a text document (it works for single words)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM