简体   繁体   English

如何在 sklearn 中进行多词标记化?

[英]How can I do multiword tokenization in sklearn?

I'm looking at the tokenizers in sklearn, namely CountVectorizer and DictVectorizer .我正在查看 sklearn 中的标记器,即CountVectorizerDictVectorizer I'd like to be able to debug my token counts before performing TF-IDF.我希望能够在执行 TF-IDF 之前调试我的令牌计数。 However, I'm encountering difficulty in converting my nltk.multiword tokenizer into scikit learn.但是,我在将nltk.multiword tokenizer器转换为 scikit learn 时遇到了困难。

Currently, I have the following:目前,我有以下内容:

from nltk.tokenize import MWETokenizer

tokenizer = MWETokenizer()
tokens = ["New York", "Albany", "Buffalo", "Hudson River"]
for t in tokens:
  if t.split(" "):
    print(t.split(" "))
    tokenizer.add_mwe((t.split(" ")))
  else:
    tokenizer.add_mwe(t)


# Small corpus
corpus = [
  'This is a new document about New York and the Hudson River.',
  'This is a document about California instead.'
  ]
[tokenizer.tokenize(c.split()) for c in corpus]

And I get:我得到:

[['This', 'is', 'a', 'new', 'document', 'about', 'New_York', 'and', 'the', 'Hudson', 'River.'],
 ['This', 'is', 'a', 'document', 'about', 'California', 'instead.']]

Which needs punctuation handling but recognizes "New York" as a single token, great.需要标点符号处理,但将“纽约”识别为单个标记,太好了。

Trying to apply similar to CountVectorizer , I find...尝试应用类似于CountVectorizer ,我发现...

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=tokens, lowercase=False)
# >>> CountVectorizer(vocabulary=['New York', 'Albany', 'Buffalo', 'Hudson River'])

vectorizer.fit_transform(corpus).toarray()
# array([[0, 0, 0, 0],
#        [0, 0, 0, 0]])

which is wrong.这是错误的。 How can I get counts of my (multiword) dictionary using the CountVectorizer (and ultimately TfIDFVectorizer in sklearn?如何使用CountVectorizer (以及最终在TfIDFVectorizer中的 TfIDFVectorizer)获取我的(多字)词典的计数?

You might need to specify the ngrams manually.您可能需要手动指定 ngram。 No idea if this is the right way or not:不知道这是否是正确的方法:

from sklearn.feature_extraction.text import CountVectorizer
ng_min = max(min(map(lambda x: len(x.split()), tokens)),1)
ng_max = max(map(lambda x: len(x.split()), tokens))
vectorizer = CountVectorizer(vocabulary=tokens, lowercase=False, ngram_range=(ng_min, ng_max))
vectorizer.fit_transform(corpus).toarray()

yields:产量:

array([[1, 0, 0, 1],
       [0, 0, 0, 0]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM