简体   繁体   中英

prevent pandas.combine from converting dtypes

Undesired behavior : pandas.combine turns ints to floats.

Description : My DataFrame contains a list of filenames (index) and some metadata about each:

            pags  rating  tms  glk
name                              
file1  original0       1    1    1
file2  original1       2    2    2
file3  original2       3    3    3
file4  original3       4    4    4
file5  original4       5    5    5

Sometimes I need to update some of the columns for some of the files, leaving all other cells unchanged.
Furthermore, the update can contain new files that I need to add as new rows (probably with some N/As).
The update comes in the form of another DataFrame upd :

       pags  rating
name               
file4  new0      11
file5  new1      12
file6  new2      13
file7  new3      14

Here, I want to change pags and rating for files 4,5 and append new rows for files 6,7.
I found I can do this with pd.combine :

df = df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[df.columns]
            pags  rating  tms  glk
name                              
file1  original0     1.0  1.0  1.0
file2  original1     2.0  2.0  2.0
file3  original2     3.0  3.0  3.0
file4       new0    11.0  4.0  4.0
file5       new1    12.0  5.0  5.0
file6       new2    13.0  NaN  NaN
file7       new3    14.0  NaN  NaN

The only problem is that all integer columns turned to floating points.
How do I keep the original dtypes ?
I strongly want to avoid manual .astype() for every column.

Code to create this example :

df = pd.DataFrame({
    'name': ['file1','file2','file3','file4','file5'],
    'pags': ["original"+str(i) for i in range(5)],
    'rating': [1, 2, 3, 4, 5],
    'tms': [1, 2, 3, 4, 5],
    'glk': [1, 2, 3, 4, 5],
}).set_index('name')

upd = pd.DataFrame({
    'name': ['file4','file5','file6','file7'],
    'pags': ["new"+str(i) for i in range(4)],
    'rating': [11, 12, 13, 14],
}).set_index('name')

df = df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[df.columns]

Unless I missed something, you do not have to cast .astype() for every column , only once for the whole dataframe, like this:

df = (
    df.combine(upd, lambda old, new: new.fillna(old), overwrite=False)[df.columns]
    .fillna(0)
    .astype(int, errors="ignore")
)
print(df)
# Output
            pags  rating  tms  glk
name
file1  original0       1    1    1
file2  original1       2    2    2
file3  original2       3    3    3
file4       new0      11    4    4
file5       new1      12    5    5
file6       new2      13    0    0
file7       new3      14    0    0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM