[英]Word frequencies in Pandas list
I have a column of tokenized, lemmatized text in a pandas df.我在熊猫 df 中有一列标记化、词形还原的文本。 I'm trying to create a matrix of word frequencies so that I can then go on to dimensionality reduction.
我正在尝试创建一个词频矩阵,以便我可以继续降维。
I keep running into an error that Python was expecting a string but got a list.我一直遇到一个错误,Python 需要一个字符串,但得到了一个列表。
TypeError: sequence item 0: expected str instance, list found
I've tried an handful of ways, but run into errors each time.我尝试了几种方法,但每次都遇到错误。 I'm not sure how to account for a list.
我不确定如何计算列表。
Here are a few of the methods I've tried:以下是我尝试过的一些方法:
Option 1 :选项1 :
from collections import Counter
df['new_col'] = Counter()
for token in df['col']:
counts[token.orth_] += 1
This generated ValueError: Length of values does not match length of index
这产生了
ValueError: Length of values does not match length of index
Option 2 :选项2:
Counter(' '.join(df['col']).split()).most_common()
Which generated: TypeError: sequence item 0: expected str instance, list found
其中生成:
TypeError: sequence item 0: expected str instance, list found
Option 3 :选项 3:
pd.Series(values = ','.join([(i) for i in df['col']]).lower().split()).value_counts()[:]
Which generated again: TypeError: sequence item 0: expected str instance, list found
再次生成:
TypeError: sequence item 0: expected str instance, list found
Edit: Sample data:编辑:示例数据:
col
[indicate, after, each, action, step, .]
[during, september, and, october, please, refrain]
[the, work, will, be, ongoing, throughout, the]
[professional, development, session, will, be]
Given what you've told us, the best solution as 9dogs mentioned is to use scikit-learn's CountVectorizer
.鉴于您告诉我们的内容,9dogs 提到的最佳解决方案是使用 scikit-learn 的
CountVectorizer
。 I'm making some assumptions here on what format you'd like the data in, but here's what will get you a doc x token
dataframe where the values are the counts of the tokens within a document.我在这里对您希望数据采用什么格式进行了一些假设,但这里会为您提供一个
doc x token
数据框,其中值是doc x token
的计数。 It assumes that df['col']
is a pandas series where values are lists.它假设
df['col']
是一个 Pandas 系列,其中的值是列表。
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer(analyzer=lambda x: x)
>>> counted_values = cv.fit_transform(df['col']).toarray()
>>> df = pd.DataFrame(counted_values, columns=cv.get_feature_names())
>>> df.iloc[0:5, 0:5]
. action after and be
0 1 1 1 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 0 0 0 0 1
CountVectorizer
can tokenize for you, and will by default, so we pass an identity lambda function to the analyzer
argument to tell it that our documents are pre-tokenized. CountVectorizer
可以为您标记,并且默认情况下会,因此我们将标识 lambda 函数传递给analyzer
参数,以告诉它我们的文档是预先标记的。
I wouldn't recommend this, but I think it's helpful if you want to understand how Counters work.我不会推荐这个,但我认为如果你想了解计数器是如何工作的,它会很有帮助。 Since your values are a list, you can use
.apply
on each row of your series.由于您的值是一个列表,因此您可以在系列的每一行上使用
.apply
。
>>> counted_values = df['col'].apply(lambda x: Counter(x))
>>> counted_values
0 {'.': 1, 'after': 1, 'indicate': 1, 'action': ...
1 {'during': 1, 'and': 1, 'october': 1, 'please'...
2 {'will': 1, 'ongoing': 1, 'work': 1, 'the': 2,...
3 {'development': 1, 'professional': 1, 'session...
dtype: object
So now you have a series of dicts, which isn't very helpful.所以现在你有一系列的字典,这不是很有帮助。 You could convert this to a dataframe similar to what we have above with the following:
您可以将其转换为类似于我们上面的数据帧,如下所示:
>>> suboptimal_df = pd.DataFrame(counted_values.tolist())
>>> suboptimal_df.iloc[0:5, 0:5]
. action after and be
0 1.0 1.0 1.0 NaN NaN
1 NaN NaN NaN 1.0 NaN
2 NaN NaN NaN NaN 1.0
3 NaN NaN NaN NaN 1.0
I wouldn't recommend this because apply
is slow plus it's already a little goofy that we're storing lists as Series values, dicts are equally as goofy.我不会推荐这个,因为
apply
很慢,而且我们将列表存储为系列值已经有点傻了,字典也同样傻。 DataFrames do best as structured containers of numeric or string values (think spreadsheets) and not different container types. DataFrames 最适合作为数字或字符串值的结构化容器(想想电子表格),而不是不同的容器类型。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.