保留通过 sklearn 的 CountVectorizer() 传递的参数的原始文档元素索引，以便访问相应的词性标记

Question

我有一个带有句子的数据框和每个单词的相应词性标记（下面是我正在使用的数据的摘录（数据来自SNLI语料库）。对于我收藏的每个句子，我想提取 unigrams以及该词的相应后置标签。

例如，如果我有以下内容：

vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')

doc = {'sent' : ['Two women are embracing while holding to go packages .'], 'tags' : ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}

sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()

然后我会得到以下 unigrams output：

array(['embracing', 'holding', 'packages', 'women'], dtype=object)

但我不知道如何在这之后保留词性标签。 我尝试用 unigrams 做一个查找版本，但因为它们可能与句子中的单词不同（例如，如果你做sentence.split(' ') ）你不一定得到相同的标记。 关于如何提取 unigrams 并保留相应词性标记的任何建议？

Answer 1

在查看了sklearn CountVectorizer class的源代码，特别是fit function 之后，我不相信 class 有任何方法可以跟踪原始文档元素索引相对于提取的一元特征：其中一元特征不一定具有相同的令牌。 除了下面提供的简单解决方案外，您可能还必须依赖其他一些方法/库才能获得所需的结果。 如果有一个特定的案例失败了，我建议将其添加到您的问题中，因为它可能会帮助人们为您的问题找到解决方案。

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')

doc = {'sent': ['Two women are embracing while holding to go packages .'],
       'tags': ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}

sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()

sent_token_list = doc['sent'][0].split()
tags_token_list = doc['tags'][0].split()
sentence_tags = []

for unigram in sentence_unigrams:
    for i in range(len(sent_token_list)):
        if sent_token_list[i] == unigram:
            sentence_tags.append(tags_token_list[i])

print(sentence_unigrams)
# Output: ['embracing' 'holding' 'packages' 'women']
print(sentence_tags)
# Output: ['VERB', 'VERB', 'NOUN', 'NOUN']

保留通过 sklearn 的 CountVectorizer() 传递的参数的原始文档元素索引，以便访问相应的词性标记

问题描述

1 个解决方案

解决方案1
0 2022-11-29 12:25:46

保留通过 sklearn 的 CountVectorizer() 传递的参数的原始文档元素索引，以便访问相应的词性标记

问题描述

1 个解决方案

解决方案1 0 2022-11-29 12:25:46

解决方案1
0 2022-11-29 12:25:46