简体   繁体   中英

Python / Pandas : fill missing values in specific rows and columns

I'm quite seasoned in R and now learning Python by trying to 'translate' an existing series of scripts from R to Python ( df is a pandas DataFrame). I'm stuck at this line :

df[df$id != df$id_old, c("col1", "col2")] <- NA

Ie I'm trying to fill NA values in specific rows / columns. I've been trying different things, the most promising route seemed to be

index = np.where(df.id != df.id_old)
df.col1[index] = np.repeat(np.nan, np.size(index))

But this throws the following error at the second line (don't fully understand this).

Can only tuple-index with a MultiIndex

What would be the cleanest way to achieve my objective?

Example :

df = pd.DataFrame({'id' : [1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5], 
    'id_old' : [1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 5], 
    'col1' : np.random.normal(size = 12), 
    'col2' : np.random.randint(low = 20, high = 50, size = 12),
    'col3' : np.repeat('other info', 12)})

Output :

   id  id_old      col1  col2        col3
0    1       1  0.320982    31  other info
1    1       1  0.398855    42  other info
2    1       2 -0.664073    30  other info
3    2       2  1.428694    48  other info
4    2       3 -1.240363    49  other info
5    3       4  0.023167    42  other info
6    4       4 -0.645114    44  other info
7    4       4 -1.033602    47  other info
8    4       4  0.295143    27  other info
9    4       5  0.531660    32  other info
10   5       5 -0.787401    33  other info
11   5       5  2.033503    48  other info

Expected result :

   id  id_old      col1  col2        col3
0    1       1  0.320982    31  other info
1    1       1  0.398855    42  other info
2    1       2       NaN   NaN  other info
3    2       2  1.428694    48  other info
4    2       3       NaN   NaN  other info
5    3       4       NaN   NaN  other info
6    4       4 -0.645114    44  other info
7    4       4 -1.033602    47  other info
8    4       4  0.295143    27  other info
9    4       5       NaN   NaN  other info
10   5       5 -0.787401    33  other info
11   5       5  2.033503    48  other info

use .loc and pass a list where in R you would do c(...)

loc allows to do in-place assignment.


df.loc[df.id!=df.id_old, ['col1', 'col2']] = np.nan


        col1  col2        col3  id  id_old
0   2.411473  31.0  other info   1       1
1   0.874083  43.0  other info   1       1
2        NaN   NaN  other info   1       2
3   2.156903  20.0  other info   2       2
4        NaN   NaN  other info   2       3
5        NaN   NaN  other info   3       4
6   0.933760  22.0  other info   4       4
7  -1.239806  42.0  other info   4       4
8  -0.493344  41.0  other info   4       4
9        NaN   NaN  other info   4       5
10 -0.751290  30.0  other info   5       5
11  0.327527  31.0  other info   5       5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM