简体   繁体   中英

Filling in blanks on a DataFrame with numbered variables - Python Pandas

I have a DataFrame of the format:

ID    Theme    Operation    Volume
100  Jungle       S3         Full
200  Desert       S3         Full
302  Cavern       S1         Empty
303  Swamp        nan        Full
400  Jungle       S3          nan
600  Desert       nan        Empty

Where I would like to write a script that iterates through the empty cells and reassigns them from 'nan', and replaces them with a variable NA_ where the _ is a count of how many missing variables they are. So my desired output would be:

ID    Theme    Operation    Volume
100  Jungle       S3         Full
200  Desert       S3         Full
302  Cavern       S1         Empty
303  Swamp        NA1        Full
400  Jungle       S3          NA3
600  Desert       NA2        Empty

When I try to iterate over the df and identify the nan values, for some reason the following did not work.

count = 0
for col in df.colums:
    for row in df[col]:
        if row == float('nan'):
            row = 'NA{}'.format(count)
            count += 1

Any ideas why? Or is there a better way to do this that I'm struggling to see?

Thanks:)

Concatenate your columns, replace NaN by NA_ (_ is replaced by num ) and split your columns. Finally override modified columns to your original dataframe:

tmp = df.reset_index().melt(id_vars='index', value_vars=['Operation', 'Volume'])
num = (tmp['value'].isna().cumsum()).astype(int)
tmp['value'] = tmp['value'].fillna('NA' + num.astype(str))
tmp = tmp.pivot(index='index', columns='variable', values='value')
df[tmp.columns] = tmp
>>> df
    ID   Theme Operation Volume
0  100  Jungle        S3   Full
1  200  Desert        S3   Full
2  302  Cavern        S1  Empty
3  303   Swamp       NA1   Full
4  400  Jungle        S3    NA3
5  600  Desert       NA2  Empty

a little difficult, but not impossible.

What's important is to create a hierarchy when sorting column --> index to create a cumulative sum per column based on whether the value is NA. Basically you don't want Volume NA values to be counted before Operation.

s = df.stack(dropna=False).reset_index()

s['level_1'] = pd.Categorical(s['level_1'],categories=df.columns.tolist())

s1 = s.sort_values(by=['level_1','level_0']).set_index(['level_0','level_1']
                 ).isna().cumsum().unstack(1).droplevel(0,1)

df = df.fillna('NA_' + s1.astype(str))

    ID   Theme Operation Volume
0  100  Jungle        S3   Full
1  200  Desert        S3   Full
2  302  Cavern        S1  Empty
3  303   Swamp      NA_1   Full
4  400  Jungle        S3   NA_3
5  600  Desert      NA_2  Empty

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM