简体   繁体   English

Pandas 列表中的词频

[英]Word frequencies in Pandas list

I have a column of tokenized, lemmatized text in a pandas df.我在熊猫 df 中有一列标记化、词形还原的文本。 I'm trying to create a matrix of word frequencies so that I can then go on to dimensionality reduction.我正在尝试创建一个词频矩阵,以便我可以继续降维。

I keep running into an error that Python was expecting a string but got a list.我一直遇到一个错误,Python 需要一个字符串,但得到了一个列表。 TypeError: sequence item 0: expected str instance, list found

I've tried an handful of ways, but run into errors each time.我尝试了几种方法,但每次都遇到错误。 I'm not sure how to account for a list.我不确定如何计算列表。

Here are a few of the methods I've tried:以下是我尝试过的一些方法:

Option 1 :选项1 :

from collections import Counter
df['new_col'] = Counter()
for token in df['col']:
    counts[token.orth_] += 1

This generated ValueError: Length of values does not match length of index这产生了ValueError: Length of values does not match length of index

Option 2 :选项2:

Counter(' '.join(df['col']).split()).most_common()

Which generated: TypeError: sequence item 0: expected str instance, list found其中生成: TypeError: sequence item 0: expected str instance, list found

Option 3 :选项 3:

pd.Series(values = ','.join([(i) for i in df['col']]).lower().split()).value_counts()[:]

Which generated again: TypeError: sequence item 0: expected str instance, list found再次生成: TypeError: sequence item 0: expected str instance, list found

Edit: Sample data:编辑:示例数据:

col
[indicate, after, each, action, step, .]
[during, september, and, october, please, refrain]
[the, work, will, be, ongoing, throughout, the]
[professional, development, session, will, be]

Easy Answer简单回答

Given what you've told us, the best solution as 9dogs mentioned is to use scikit-learn's CountVectorizer .鉴于您告诉我们的内容,9dogs 提到的最佳解决方案是使用 scikit-learn 的CountVectorizer I'm making some assumptions here on what format you'd like the data in, but here's what will get you a doc x token dataframe where the values are the counts of the tokens within a document.我在这里对您希望数据采用什么格式进行了一些假设,但这里会为您提供一个doc x token数据框,其中值是doc x token的计数。 It assumes that df['col'] is a pandas series where values are lists.它假设df['col']是一个 Pandas 系列,其中的值是列表。

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer(analyzer=lambda x: x)
>>> counted_values = cv.fit_transform(df['col']).toarray()
>>> df = pd.DataFrame(counted_values, columns=cv.get_feature_names())
>>> df.iloc[0:5, 0:5]
   .  action  after  and  be
0  1       1      1    0   0
1  0       0      0    1   0
2  0       0      0    0   1
3  0       0      0    0   1

CountVectorizer can tokenize for you, and will by default, so we pass an identity lambda function to the analyzer argument to tell it that our documents are pre-tokenized. CountVectorizer可以为您标记,并且默认情况下会,因此我们将标识 lambda 函数传递给analyzer参数,以告诉它我们的文档是预先标记的。

Suboptimal Answer次优答案

I wouldn't recommend this, but I think it's helpful if you want to understand how Counters work.我不会推荐这个,但我认为如果你想了解计数器是如何工作的,它会很有帮助。 Since your values are a list, you can use .apply on each row of your series.由于您的值是一个列表,因此您可以在系列的每一行上使用.apply

>>> counted_values = df['col'].apply(lambda x: Counter(x))
>>> counted_values
0    {'.': 1, 'after': 1, 'indicate': 1, 'action': ...
1    {'during': 1, 'and': 1, 'october': 1, 'please'...
2    {'will': 1, 'ongoing': 1, 'work': 1, 'the': 2,...
3    {'development': 1, 'professional': 1, 'session...
dtype: object

So now you have a series of dicts, which isn't very helpful.所以现在你有一系列的字典,这不是很有帮助。 You could convert this to a dataframe similar to what we have above with the following:您可以将其转换为类似于我们上面的数据帧,如下所示:

>>> suboptimal_df = pd.DataFrame(counted_values.tolist())
>>> suboptimal_df.iloc[0:5, 0:5]
     .  action  after  and   be
0  1.0     1.0    1.0  NaN  NaN
1  NaN     NaN    NaN  1.0  NaN
2  NaN     NaN    NaN  NaN  1.0
3  NaN     NaN    NaN  NaN  1.0

I wouldn't recommend this because apply is slow plus it's already a little goofy that we're storing lists as Series values, dicts are equally as goofy.我不会推荐这个,因为apply很慢,而且我们将列表存储为系列值已经有点傻了,字典也同样傻。 DataFrames do best as structured containers of numeric or string values (think spreadsheets) and not different container types. DataFrames 最适合作为数字或字符串值的结构化容器(想想电子表格),而不是不同的容器类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM