简体   繁体   中英

df.loc to replace comma separated numbers in dataframes

I downloaded dataframes from here: https://ods.od.nih.gov/HealthInformation/Dietary_Reference_Intakes.aspx

using BeautifulSoup but some of the numeric values have a thousands separator and "asterisks" both of which I want to take out. I have regex to take out the "asterisks" but tried using str.replace(",", "") on the comma and then inserting the new string using.loc. My code:

#iterate each df field and if comma sep, replace
for name,df in df_dict.items():
    print(name, df.dtypes)
    cols = list(df.columns)
    #print(cols)
    for idx, row in df.iterrows():
        # skip lifestage group col
        for i in range(1,len(cols)):
            curr_val = str(row[cols[i]])
            print(f'curr_val: {type(curr_val),curr_val}')
            print(f'row[0]:{row[cols[0]]}')
            if "," in curr_val:
                clean_val = curr_val.replace(",", "")
                print(f'comma: {df.loc[row[cols[0]], cols[i]]}')
                df.loc[row[cols[0]],cols[i]] = clean_val
                print(f'no comma: {df.loc[row[cols[0]], cols[i]]}\n')
            

The df.dtypes shows

Life-Stage Group     object
Calcium (mg/d)       object
Chromium (μg/d)      object
Copper (μg/d)        object
Fluoride (mg/d)      object
Iodine (μg/d)        object
Iron (mg/d)          object
Magnesium (mg/d)     object
Manganese (mg/d)     object
Molybdenum (μg/d)    object
Phosphorus (mg/d)    object
Selenium (μg/d)      object
Zinc (mg/d)          object
Potassium (mg/d)     object
Sodium (mg/d)        object
Chloride (g/d)       object
dtype: object

so I think it should work but actually no changes occur.

Ideally I want to take both commas and "*" and just keep the int or float value.

@piterbarg's answer was correct. Edited to this and it works:

#iterate each df field and if comma sep, replace
for name,df in df_dict.items():
    str_df = df.copy().astype(str)
    cols = list(df.columns)
    print(f'cols[0]: {cols[0]}')
    
    # skip lifestage group col
    for i in range(1,len(cols)):
        str_df[cols[i]] = str_df[cols[i]].str.replace(',', '').str.replace('*','')


    df_dict[name] = str_df

Without access to your df it is hard to help you. See how to provide a great pandas example as well as minimal, complete, and verifiable example .

But a few things look suspicious in your code, specifically this: df.loc[row[cols[0]], cols[i]] . .loc function takes df index as the first argument so I would have thought this should be df.loc[idx, cols[i]] in a couple of places. so I am a bit surprised it actually does not complain there.

also you can do your replacements on columns in one go, along the lines of

# loop over columns i here
df[cols[i]] = df[cols[i]].str.replace(',','').str.replace('*','')
df[cols[i]] = df[cols[i]].astype(float) # or int

this is generally much preferred to the iterrows() loop you have there

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM