简体   繁体   中英

Replace incorrectly formatted values in a dataframe

I am importing an Excel spreadsheet as a dataframe using pandas. The spreadsheet is manually maintained and contains several data entry errors, the most common of which is integers formatted as strings with leading non-breaking spaces ('\\xa0'). The spreadsheet is updated regularly, so where and when these pesky inconsistencies pop up is totally unpredictable.

Basically, I am trying to find a clean way to find and re-format these values. As they are mainly restricted to one column, I have tried several versions of this:

for entry in df.loc[:, 'col']:
    if type(row) == str:
        row = row.replace(u'\xa0', u'')

If I add a print(row) call inside the for loop, it prints exactly what I expect, ie, ' 1187383' becomes '1187383'. However, outside of the for loop, the value is not being replaced. Once the loop executes, calling .loc returns the unchanged entry (' 1187383').

I'm sure I'm missing something obvious here, but I've now invested about a day trying to find the solution. Any help is appreciated! And please let me know if you need more information.

I'd recommend trying Bharath Shetty's suggestion , but with a slight improvement:

s = df['col'].astype(str).str.replace('[^0-9.]', '')
df['col'] = pd.to_numeric(s, errors='coerce')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM