[英]How can I prevent words with hyphens from being tokenized when using scikit-learn`s term document matrix?
[英]How can I preserve hypenated words when building a text document matrix with the python textmining module?
我在下面的這段代碼中將一段文本與停用詞集進行比較,並返回文本中不在停用詞集中的單詞列表。 然后,我將單詞列表更改為字符串,以便可以在文本挖掘模塊中使用它來創建術語文檔矩陣。
我在代碼中進行了檢查,這些代碼表明連字在列表和字符串中得到維護,但是一旦我將它們傳遞給代碼的TDM部分,連字就會被分解。 有沒有辦法在文本挖掘模塊和TDM中維護帶連字符的單詞?
import re
f= open ("words") #dictionary
stops = set()
for line in f:
stops.add(line.strip())
f = open ("azathoth") #Azathoth (1922)
azathoth = list()
for line in f:
azathoth.extend(re.findall("[A-z\-\']+", line.strip()))
azathothcount = list()
for w in azathoth:
if w in stops:
continue
else:
azathothcount.append(w)
print azathothcount[1:10]
raw_input('Press Enter...')
azathothstr = ' '.join(azathothcount)
print azathothstr
raw_input('Press Enter...')
import textmining
def termdocumentmatrix_example():
doc1 = azathothstr
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(doc1)
tdm.write_csv('matrixhp.csv', cutoff=1)
for row in tdm.rows(cutoff=1):
print row
raw_input('Press Enter...')
termdocumentmatrix_example()
初始化TermDocumentMatrix類時,文本挖掘程序包默認為其自身的'simple_tokenize'函數。 add_doc()將您的文本通過simple_tokenize()推送,然后再將其添加到tdm。
幫助(文本挖掘)部分產生:
class TermDocumentMatrix(__builtin__.object)
| Class to efficiently create a term-document matrix.
|
| The only initialization parameter is a tokenizer function, which should
| take in a single string representing a document and return a list of
| strings representing the tokens in the document. If the tokenizer
| parameter is omitted it defaults to using textmining.simple_tokenize
|
| Use the add_doc method to add a document (document is a string). Use the
| write_csv method to output the current term-document matrix to a csv
| file. You can use the rows method to return the rows of the matrix if
| you wish to access the individual elements without writing directly to a
| file.
|
| Methods defined here:
|
| __init__(self, tokenizer=<function simple_tokenize>)
|
| ...
|
|simple_tokenize(document)
| Clean up a document and split into a list of words.
|
| Converts document (a string) to lowercase and strips out
| everything which is not a lowercase letter.
因此,您必須滾動自己的不會在連字符上拆分的令牌生成器,並在初始化TermDocumentMatrix類時將其傳遞通過。
在我看來,最好是此過程保留simple_tokenize()函數的其余功能-減去刪除帶連字符的單詞,以便您可以在該函數的結果周圍傳送帶連字符的單詞。 下面,我從文檔中刪除了帶連字符的單詞,將其余部分通過simple_tokenize()推送,然后合並了兩個列表(帶連字符的單詞+ simple_tokenize()結果),然后將它們添加到tdm中:
doc1 = 'blah "blah" blahbitty-blah, in-the bloopity blip bleep br-rump! '
import re
def toknzr(txt):
hyph_words = re.findall(r'\w+(?:-\w+)+',txt)
remove = '|'.join(hyph_words)
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
simple = regex.sub("", txt)
return(hyph_words + textmining.simple_tokenize(simple))
tdm = textmining.TermDocumentMatrix(tokenizer = toknzr)
tdm.add_doc(doc1)
這可能不是制作自己的令牌生成器的最Python方式(贊賞反饋!),但是這里的要點是,您必須使用新的令牌生成器初始化該類,而不要使用默認的simple_tokenize()。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.