简体   繁体   English

如何从熊猫数据框中创建一个单词包

[英]How to create a bag of words from a pandas dataframe

Here's my dataframe 这是我的数据框

    CATEGORY    BRAND
0   Noodle  Anak Mas
1   Noodle  Anak Mas
2   Noodle  Indomie
3   Noodle  Indomie
4   Noodle  Indomie
23  Noodle  Indomie
24  Noodle  Mi Telor Cap 3
25  Noodle  Mi Telor Cap 3
26  Noodle  Pop Mie
27  Noodle  Pop Mie
...

I already make sure that df type is string, my code is 我已经确定df类型是字符串,我的代码是

df = data[['CATEGORY', 'BRAND']].astype(str)
import collections, re
texts = df
bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
sumbags = sum(bagsofwords, collections.Counter())

When I call 我打电话的时候

sumbags

The output is 输出是

 Counter({'BRAND': 1, 'CATEGORY': 1})

I want all of the data count in sumbags, except the title, to make it clear something like 我希望除了标题之外,所有数据都计入sumbags,以便明确表达类似的内容

Counter({'Noodle': 10, 'Indomie': 4, 'Anak': 2, ....}) # because it is bag of words

I need every 1 word counts 我需要每1个字数

IIUIC, use IIUIC,使用

Option 1] Numpy flatten and split 选项1] Numpy flattensplit

In [2535]: collections.Counter([y for x in df.values.flatten() for y in x.split()])
Out[2535]:
Counter({'3': 2,
         'Anak': 2,
         'Cap': 2,
         'Indomie': 4,
         'Mas': 2,
         'Mi': 2,
         'Mie': 2,
         'Noodle': 10,
         'Pop': 2,
         'Telor': 2})

Option 2] Use value_counts() 选项2]使用value_counts()

In [2536]: pd.Series([y for x in df.values.flatten() for y in x.split()]).value_counts()
Out[2536]:
Noodle     10
Indomie     4
Mie         2
Pop         2
Anak        2
Mi          2
Cap         2
Telor       2
Mas         2
3           2
dtype: int64

Options 3] Use stack and value_counts 选项3]使用stackvalue_counts

In [2582]: df.apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()
Out[2582]:
Noodle     10
Indomie     4
Mie         2
Pop         2
Anak        2
Mi          2
Cap         2
Telor       2
Mas         2
3           2
dtype: int64

Details 细节

In [2516]: df
Out[2516]:
   CATEGORY           BRAND
0    Noodle        Anak Mas
1    Noodle        Anak Mas
2    Noodle         Indomie
3    Noodle         Indomie
4    Noodle         Indomie
23   Noodle         Indomie
24   Noodle  Mi Telor Cap 3
25   Noodle  Mi Telor Cap 3
26   Noodle         Pop Mie
27   Noodle         Pop Mie

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 基于熊猫数据框中的单词袋的拟合 - Fitration based on bag of words in pandas dataframe 如何在熊猫数据集中构建袋式短语(和剩余单词) - How to build bag of phrase (and remainging words) in pandas dataset 如何从词袋数据集中创建有效的术语文档矩阵 - How to create an efficient term-document matrix from bag-of-words dataset 如何从熊猫词典中创建到特定单词列表的前5个接近单词的数据框 - How to create dataframe of top 5 close words to a particular word lists from a dictionary in pandas 如何确定可以从一袋字母和一袋单词python中制作的单词数量和词组 - How to determine the count and set of words that can be made from a bag of letters and bag of words python 如何从熊猫数据框创建Teradata数据框? - How to create a teradata dataframe from pandas dataframe? 如何将ML算法与来自词袋的特征向量数据一起使用? - How to use ML Algoritms with feature vector data from bag of words? 如何让用户输入从现有的词袋中提取? - how to make user input pull from existing bag of words? 如何从现有 pandas Z6A8064B5CDF47945557DZCDF47945557DZC553 - How to create a pandas dataframe from a subset of an existing pandas dataframe 熊猫列中的单词展开袋(python) - Unfolding bag of words in pandas column (python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM