Python-从标记列表到单词袋

Question

I am struggling with computing bag of words. 我正在努力计算单词。 I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem. 我有一个带有文本列的pandas数据框，可以对其进行正确标记，删除停用词和词干。 In the end, for each document, I have a list of strings. 最后，对于每个文档，我都有一个字符串列表。

My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string. 我的最终目标是计算该列的单词数，我已经看到scikit-learn具有执行此功能的功能，但它适用于字符串，而不适用于字符串列表。

I am doing the preprocessing myself with NLTK and would like to keep it that way... 我正在用NLTK进行预处理，并希望保持这种状态...

Is there a way to compute bag of words based on a list of list of tokens ? 有没有一种方法可以根据令牌列表来计算单词袋？ eg, something like that: 例如，像这样的东西：

["hello", "world"]
["hello", "stackoverflow", "hello"]

should be converted into 应该转换成

[1, 1, 0]
[2, 0, 1]

with vocabulary: 词汇：

["hello", "world", "stackoverflow"]

Answer 1

You can create DataFrame by filtering with Counter and then convert to list s: 您可以通过使用Counter过滤来创建DataFrame ，然后将其转换为list ：

from collections import Counter

df = pd.DataFrame({'text':[["hello", "world"],
                           ["hello", "stackoverflow", "hello"]]})

L = ["hello", "world", "stackoverflow"]

f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
               .fillna(0)
               .astype(int)
               .reindex(columns=L)
               .values
               .tolist())
print (df)

                            text        new
0                 [hello, world]  [1, 1, 0]
1  [hello, stackoverflow, hello]  [2, 0, 1]

Answer 2

sklearn.feature_extraction.text.CountVectorizer can help a lot. sklearn.feature_extraction.text.CountVectorizer可以提供很多帮助。 Here's the excample of official document: 这是官方文件的示例：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray() 
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
   [0, 1, 0, 1, 0, 2, 1, 0, 1],
   [1, 0, 0, 0, 1, 0, 1, 1, 0],
   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names(). 您可以使用vectorizer.get_feature_names（）方法获取功能名称。

Answer 3

Using sklearn.feature_extraction.text.CountVectorizer 使用sklearn.feature_extraction.text.CountVectorizer

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame({'text': [['hello', 'world'], 
                        ['hello', 'stackoverflow', 'hello']]
                   })

## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))

vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)

print(vectorizer.get_feature_names())
print(x.toarray())

Output: 输出：

['hello', 'stackoverflow', 'world']

[[1 0 1]
 [2 1 0]]

Python-从标记列表到单词袋

问题描述

3 个解决方案

解决方案1
3 2018-01-27 09:45:41

解决方案2
1 2018-01-27 10:05:26

解决方案3
0 2018-12-31 19:28:58

Using sklearn.feature_extraction.text.CountVectorizer 使用sklearn.feature_extraction.text.CountVectorizer

Python-从标记列表到单词袋

问题描述

3 个解决方案

解决方案1 3 2018-01-27 09:45:41

解决方案2 1 2018-01-27 10:05:26

解决方案3 0 2018-12-31 19:28:58

Using sklearn.feature_extraction.text.CountVectorizer 使用sklearn.feature_extraction.text.CountVectorizer

解决方案1
3 2018-01-27 09:45:41

解决方案2
1 2018-01-27 10:05:26

解决方案3
0 2018-12-31 19:28:58