简体   繁体   English

熊猫-在数据框中添加标志列

[英]Pandas - Add a flag column in dataframe

I have a dataframe like: 我有一个像这样的数据框:

Client_ID    Product_nb   Item_id
1            1            i1  
1            1            i2
1            1            i3

1            2            i2
1            2            i5  
1            2            i7

1            3            i1
1            3            i2
1            3            i4
1            3            i6

2            1            i1
2            1            i2
2            1            i3
2            1            i4

2            2            i1
2            2            i2
...          ...          ...

So each client ( client_id ) has several products ( Product_nb ). 因此,每个客户端( client_id )具有多个产品( Product_nb )。 For each product, i want to keep only one item ( item_id ). 对于每种产品,我只想保留一项( item_id )。 And for same client, the next product should not correspond to the previous product. 对于同一客户,下一个产品不应与前一个产品相对应。

I want to add a flag next to each item if i need to keep the item or not : 如果要保留项目,我想在每个项目旁边添加一个标志:

Client_ID    Product_nb   Item_id   Keep
1            1            i1        1
1            1            i2        0
1            1            i3        0

1            2            i2        1
1            2            i5        0
1            2            i7        0

1            3            i1        0
1            3            i2        0
1            3            i4        1
1            3            i6        0

2            1            i1        1
2            1            i2        0
2            1            i3        0
2            1            i4        0

2            2            i1        0
2            2            i2        1
...          ...          ...       ...

My idea for this was to iterate over all clients and products. 我的想法是遍历所有客户和产品。 For each client, save the items that have been kept in a list : 对于每个客户,将已保存的项目保存在列表中:

df = df.set_index(['client_id','product_nb','item_id','keep'])
client_ids = df.index.get_level_values('client_id').unique()
for client in client_ids:
    list_already = []
    prod_nbs = df.loc[client].index.get_level_values('product_nb').unique()
    for prod_nb in prod_nbs:
        item_ids = df.loc[client,prod_nb].index.get_level_values('item_id').unique()
        for item_id in item_ids:
            if (item_id in list_already):
                df.loc[client,prod_nb,item_id,'keep'] = 1
                continue
            else:
                list_already.append(item_id)
                df.loc[client,prod_nb,item_id,'keep'] = 1
                break

But this returns me the input dataframe. 但这会返回我输入数据帧。

I'll be greatful to any sort of help. 我将竭诚为您提供任何帮助。 Thank you 谢谢

In pandas you usually don't wanto to loop over your DataFrame. 在熊猫中,您通常不希望遍历DataFrame。 It is slow and there are much more optimized routines for almost anything. 它很慢,并且几乎所有东西都有更多优化的例程。 In your case 就你而言

df.groupby(['Client_ID', 'Product_nb'])['Item_id'].first()

does the job. 做这份工作。 Replace df by the name of your DataFrame 用DataFrame的名称替换df

Edit: I overread the contraint, that your chosen value should be unique. 编辑:我过度理解了约束,您选择的值应该是唯一的。 It would probably be best to filter the values beforehand and groupby afterwards 最好事先过滤值,然后再过滤groupby

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM