简体   繁体   English

将 panda dataframe 值组转换为多个列表

[英]Convert panda dataframe group of values to multiple lists

I have pandas dataframe, where I listed items, and categorised them:我有 pandas dataframe,我在其中列出了项目,并对它们进行了分类:

col_name    |col_group
-------------------------
id          | Metadata
listing_url | Metadata
scrape_id   | Metadata
name        | Text
summary     | Text
space       | Text

To reproduce:重现:

import pandas

df = pandas.DataFrame([
    ['id','metadata'],
    ['listing_url','metadata'],
    ['scrape_id','metadata'],
    ['name','Text'],
    ['summary','Text'],
    ['space','Text']],
    columns=['col_name', 'col_group'])

Can you suggest how I can convert this dataframe to multiple lists based on "col_group":你能建议我如何将这个 dataframe 转换为基于“col_group”的多个列表:

Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']

This is to allow me to pass these lists of columns to panda and drop columns.这是为了允许我将这些列列表传递给 panda 并删除列。

I googled a lot and got stuck: all answers are about converting lists to df, not vice versa.我用谷歌搜索了很多并卡住了:所有答案都是关于将列表转换为 df,反之亦然。 Should I aim to convert into dictionary, or list of lists?我的目标应该是转换成字典还是列表?

I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.我有超过 100 行,属于 10 个类别,所以想避免手动硬编码。

Like this:像这样:

In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)

In [248]: res.tolist()                                                                                                                                                                                      
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]

I've try this code:我试过这段代码:

import pandas

df = pandas.DataFrame([
    [1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
    [2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
    [3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
    columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)

for row in df.iterrows():
    print(row[1].to_list())

which give this answer:给出了这个答案:

[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']

You can use您可以使用

for row in df[['name', 'summary', 'space']].iterrows():

to only iter over specific columns.仅迭代特定列。

my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()

Output: Output:

>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}

The recommended usage would be just my_vars['Text'] to access the Text , and etc. If you must have this as distinct names you can force it upon your target scope, eg globals :推荐的用法只是my_vars['Text']来访问Text等。如果您必须将其作为不同的名称,您可以在目标 scope 上强制使用它,例如globals

globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())

Result:结果:

>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']

However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (eg locals ).但是,我建议您不要这样做,因为您可能会不知不觉地覆盖您的其他一些对象,或者它们可能不在您需要的正确 scope 中(例如locals )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM