简体   繁体   中英

How to replace a value depending on “identifier columns” and an additional condition in a pandas dataframe?

As part of some data cleaning I need to 'align' the values in a 'Column A' for each 'Year' and 'ID' combination depending if there is any value = 1 in 'Column A' for a 'Year' and 'ID' combination

I already tried np.where() but only received ValueError: Can only compare identically-labeled Series objects

Here is a short example Dataframe:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2007, 0], 
                       [2, 2008, 0], 
                       [2, 2009, 1], 
                       [3, 2007, 0], 
                       [4, 2010, 0], 
                       [4, 2011, 1], 
                       [4, 2011, 0]]), #I want to change this 0 to 1
             columns=['ID', 'Year', 'ColA'])

the result should look like this:

result = pd.DataFrame(np.array([[1, 2007, 0], 
                       [2, 2008, 0], 
                       [2, 2009, 1], 
                       [3, 2007, 0], 
                       [4, 2010, 0], 
                       [4, 2011, 1], 
                       [4, 2011, 1]]),
             columns=['ID', 'Year', 'ColA'])

We can use groupby.transform with any . Then we get a boolean back so if we transform it to int with astype we get the desired result:

m = df.groupby(['ID', 'Year'])['ColA'].transform(any).astype(int)
df['ColA'] = m
   ID  Year  ColA
0   1  2007     0
1   2  2008     0
2   2  2009     1
3   3  2007     0
4   4  2010     0
5   4  2011     1
6   4  2011     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM