使用python textmining模塊構建文本文檔矩陣時，如何保留連詞？

Question

我在下面的這段代碼中將一段文本與停用詞集進行比較，並返回文本中不在停用詞集中的單詞列表。 然后，我將單詞列表更改為字符串，以便可以在文本挖掘模塊中使用它來創建術語文檔矩陣。

我在代碼中進行了檢查，這些代碼表明連字在列表和字符串中得到維護，但是一旦我將它們傳遞給代碼的TDM部分，連字就會被分解。 有沒有辦法在文本挖掘模塊和TDM中維護帶連字符的單詞？

import re

f= open ("words")  #dictionary
stops = set()
for line in f:
    stops.add(line.strip())

f = open ("azathoth") #Azathoth (1922)
azathoth = list()
for line in f:
    azathoth.extend(re.findall("[A-z\-\']+", line.strip()))

azathothcount = list()
for w in azathoth:
    if w in stops:
        continue
    else:
        azathothcount.append(w)

print azathothcount[1:10]
raw_input('Press Enter...')

azathothstr = ' '.join(azathothcount)
print azathothstr
raw_input('Press Enter...')

import textmining

def termdocumentmatrix_example():
    doc1 = azathothstr

    tdm = textmining.TermDocumentMatrix()
    tdm.add_doc(doc1)

    tdm.write_csv('matrixhp.csv', cutoff=1)

    for row in tdm.rows(cutoff=1):
        print row

raw_input('Press Enter...')
termdocumentmatrix_example()

Answer 1

初始化TermDocumentMatrix類時，文本挖掘程序包默認為其自身的'simple_tokenize'函數。 add_doc（）將您的文本通過simple_tokenize（）推送，然后再將其添加到tdm。

幫助（文本挖掘）部分產生：

class TermDocumentMatrix(__builtin__.object)
 |  Class to efficiently create a term-document matrix.
 |  
 |  The only initialization parameter is a tokenizer function, which should
 |  take in a single string representing a document and return a list of
 |  strings representing the tokens in the document. If the tokenizer
 |  parameter is omitted it defaults to using textmining.simple_tokenize
 |  
 |  Use the add_doc method to add a document (document is a string). Use the
 |  write_csv method to output the current term-document matrix to a csv
 |  file. You can use the rows method to return the rows of the matrix if
 |  you wish to access the individual elements without writing directly to a
 |  file.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, tokenizer=<function simple_tokenize>)
 |
 |  ...
 |
 |simple_tokenize(document)
 |  Clean up a document and split into a list of words.
 |
 |  Converts document (a string) to lowercase and strips out 
 |  everything which is not a lowercase letter.

因此，您必須滾動自己的不會在連字符上拆分的令牌生成器，並在初始化TermDocumentMatrix類時將其傳遞通過。

在我看來，最好是此過程保留simple_tokenize（）函數的其余功能-減去刪除帶連字符的單詞，以便您可以在該函數的結果周圍傳送帶連字符的單詞。 下面，我從文檔中刪除了帶連字符的單詞，將其余部分通過simple_tokenize（）推送，然后合並了兩個列表（帶連字符的單詞+ simple_tokenize（）結果），然后將它們添加到tdm中：

doc1 = 'blah "blah" blahbitty-blah, in-the bloopity blip bleep br-rump! '

import re

def toknzr(txt): 
    hyph_words = re.findall(r'\w+(?:-\w+)+',txt)
    remove = '|'.join(hyph_words)
    regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
    simple = regex.sub("", txt)
    return(hyph_words + textmining.simple_tokenize(simple))

tdm = textmining.TermDocumentMatrix(tokenizer = toknzr)
tdm.add_doc(doc1)

這可能不是制作自己的令牌生成器的最Python方式（贊賞反饋！），但是這里的要點是，您必須使用新的令牌生成器初始化該類，而不要使用默認的simple_tokenize（）。

使用python textmining模塊構建文本文檔矩陣時，如何保留連詞？

問題描述

1 個解決方案

解決方案1
0 2015-09-18 17:42:46

使用python textmining模塊構建文本文檔矩陣時，如何保留連詞？

問題描述

1 個解決方案

解決方案1 0 2015-09-18 17:42:46

解決方案1
0 2015-09-18 17:42:46