[英]Python - From list of list of tokens to bag of words
I am struggling with computing bag of words. 我正在努力计算单词。 I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
我有一个带有文本列的pandas数据框,可以对其进行正确标记,删除停用词和词干。 In the end, for each document, I have a list of strings.
最后,对于每个文档,我都有一个字符串列表。
My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string. 我的最终目标是计算该列的单词数,我已经看到scikit-learn具有执行此功能的功能,但它适用于字符串,而不适用于字符串列表。
I am doing the preprocessing myself with NLTK and would like to keep it that way... 我正在用NLTK进行预处理,并希望保持这种状态...
Is there a way to compute bag of words based on a list of list of tokens ? 有没有一种方法可以根据令牌列表来计算单词袋? eg, something like that:
例如,像这样的东西:
["hello", "world"]
["hello", "stackoverflow", "hello"]
should be converted into 应该转换成
[1, 1, 0]
[2, 0, 1]
with vocabulary: 词汇:
["hello", "world", "stackoverflow"]
You can create DataFrame
by filtering with Counter
and then convert to list
s: 您可以通过使用
Counter
过滤来创建DataFrame
,然后将其转换为list
:
from collections import Counter
df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})
L = ["hello", "world", "stackoverflow"]
f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)
text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]
sklearn.feature_extraction.text.CountVectorizer can help a lot. sklearn.feature_extraction.text.CountVectorizer可以提供很多帮助。 Here's the excample of official document:
这是官方文件的示例:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray()
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/
You can get the feature name with the method vectorizer.get_feature_names(). 您可以使用vectorizer.get_feature_names()方法获取功能名称。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': [['hello', 'world'],
['hello', 'stackoverflow', 'hello']]
})
## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))
vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)
print(vectorizer.get_feature_names())
print(x.toarray())
Output: 输出:
['hello', 'stackoverflow', 'world']
[[1 0 1]
[2 1 0]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.