如何在 sklearn 中进行多词标记化？

Question

I'm looking at the tokenizers in sklearn, namely CountVectorizer and DictVectorizer .我正在查看 sklearn 中的标记器，即CountVectorizer和DictVectorizer 。 I'd like to be able to debug my token counts before performing TF-IDF.我希望能够在执行 TF-IDF 之前调试我的令牌计数。 However, I'm encountering difficulty in converting my nltk.multiword tokenizer into scikit learn.但是，我在将nltk.multiword tokenizer器转换为 scikit learn 时遇到了困难。

Currently, I have the following:目前，我有以下内容：

from nltk.tokenize import MWETokenizer

tokenizer = MWETokenizer()
tokens = ["New York", "Albany", "Buffalo", "Hudson River"]
for t in tokens:
  if t.split(" "):
    print(t.split(" "))
    tokenizer.add_mwe((t.split(" ")))
  else:
    tokenizer.add_mwe(t)


# Small corpus
corpus = [
  'This is a new document about New York and the Hudson River.',
  'This is a document about California instead.'
  ]
[tokenizer.tokenize(c.split()) for c in corpus]

And I get:我得到：

[['This', 'is', 'a', 'new', 'document', 'about', 'New_York', 'and', 'the', 'Hudson', 'River.'],
 ['This', 'is', 'a', 'document', 'about', 'California', 'instead.']]

Which needs punctuation handling but recognizes "New York" as a single token, great.需要标点符号处理，但将“纽约”识别为单个标记，太好了。

Trying to apply similar to CountVectorizer , I find...尝试应用类似于CountVectorizer ，我发现...

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=tokens, lowercase=False)
# >>> CountVectorizer(vocabulary=['New York', 'Albany', 'Buffalo', 'Hudson River'])

vectorizer.fit_transform(corpus).toarray()
# array([[0, 0, 0, 0],
#        [0, 0, 0, 0]])

which is wrong.这是错误的。 How can I get counts of my (multiword) dictionary using the CountVectorizer (and ultimately TfIDFVectorizer in sklearn?如何使用CountVectorizer （以及最终在TfIDFVectorizer中的 TfIDFVectorizer）获取我的（多字）词典的计数？

Answer 1

You might need to specify the ngrams manually.您可能需要手动指定 ngram。 No idea if this is the right way or not:不知道这是否是正确的方法：

from sklearn.feature_extraction.text import CountVectorizer
ng_min = max(min(map(lambda x: len(x.split()), tokens)),1)
ng_max = max(map(lambda x: len(x.split()), tokens))
vectorizer = CountVectorizer(vocabulary=tokens, lowercase=False, ngram_range=(ng_min, ng_max))
vectorizer.fit_transform(corpus).toarray()

yields:产量：

array([[1, 0, 0, 1],
       [0, 0, 0, 0]])

如何在 sklearn 中进行多词标记化？

问题描述

1 个解决方案

解决方案1
1 2021-05-03 19:21:40

如何在 sklearn 中进行多词标记化？

问题描述

1 个解决方案

解决方案1 1 2021-05-03 19:21:40

解决方案1
1 2021-05-03 19:21:40