Pandas storing NaN value when adding new column to existing DataFrame

Question

Reading.xls files that contains Brazilian population estimates since 2000, I start with the 2000.xls file populating a dataframe called main_df that at first looks like

STATE    STATE_CODE    CITY       CITY_CODE      2000_POP
SP       X             Sao Paulo  Y              10.000.000
...

After iterating over *.xls files from 2001 until 2020 main_df should look like:

STATE    STATE_CODE    CITY       CITY_CODE      2000_POP     2001_POP  2002_POP   ...  2020_POP
SP       X             Sao Paulo  Y              10.000.000   m         n          ...  p
...

To make it happen I'm using Pandas in a not very efficient way, iterating over df rows, but anyhow that was the way I found to find the population size looking for the city and state codes.

Being df the dataframes that represents city population estimates for 2001 ~ 2020.
Here's the code snippet that iterates over every df rows trying to populate main_df :

df = pd.read_excel(filename, encoding='latin_1', sep=',')

column_year_id = filename.strip('.xls')
df.columns = ['STATE', 'STATE_CODE', 'CITY', 'CITY_CODE', column_year_id]

for index, row in df.iterrows():
    target_uf = (row['STATE_CODE'])
    target_city_code = (str(row['CITY_CODE']))
    population_on_current_year = row[-1]
                                                
    selection = (main_df['STATE_CODE'] == target_uf) & (main_df['CITY_CODE'] == target_city_code)
                   
    main_df.loc[selection, column_year_id] = population_on_current_year

The problem is that at the end of the day main_df ends up with only its original 2000 population size column filled, but, from 2001 to 2020 its filled with NaN values looking like:

STATE    STATE_CODE    CITY       CITY_CODE      2000_POP     2001_POP  2002_POP   ...  2020_POP
SP       X             Sao Paulo  Y              10.000.000   NaN       NaN        ...  NaN
...

Why is it happening and what should I do to make it work?

It seems that the problem is because I am not able to insert an element to an specific position like if main_df was an array using main_df[index, column] . Does Pandas allows this kind of insertion?

Edit 1: This is how I create main_df :

main_df = pd.read_excel(filename, encoding='latin_1', sep=',')

Answer 1

I got able to do what I wish with:

selection = (main_df['COD_UF'] == target_state) & (main_df['COD_MUN'] == target_city)
index = main_df.loc[selection].index
main_df.loc[index.values[0], column_year_id] = population_on_current_year

Pandas storing NaN value when adding new column to existing DataFrame

Question

1 answers

solution1
0 2020-10-26 10:26:06

Pandas storing NaN value when adding new column to existing DataFrame

Question

1 answers

solution1 0 2020-10-26 10:26:06

solution1
0 2020-10-26 10:26:06