Replace incorrectly formatted values in a dataframe

Question

I am importing an Excel spreadsheet as a dataframe using pandas. The spreadsheet is manually maintained and contains several data entry errors, the most common of which is integers formatted as strings with leading non-breaking spaces ('\\xa0'). The spreadsheet is updated regularly, so where and when these pesky inconsistencies pop up is totally unpredictable.

Basically, I am trying to find a clean way to find and re-format these values. As they are mainly restricted to one column, I have tried several versions of this:

for entry in df.loc[:, 'col']:
    if type(row) == str:
        row = row.replace(u'\xa0', u'')

If I add a print(row) call inside the for loop, it prints exactly what I expect, ie, ' 1187383' becomes '1187383'. However, outside of the for loop, the value is not being replaced. Once the loop executes, calling .loc returns the unchanged entry (' 1187383').

I'm sure I'm missing something obvious here, but I've now invested about a day trying to find the solution. Any help is appreciated! And please let me know if you need more information.

Answer 1

I'd recommend trying Bharath Shetty's suggestion , but with a slight improvement:

s = df['col'].astype(str).str.replace('[^0-9.]', '')
df['col'] = pd.to_numeric(s, errors='coerce')

Replace incorrectly formatted values in a dataframe

Question

1 answers

solution1
2 ACCPTED 2017-10-04 14:17:15

Replace incorrectly formatted values in a dataframe

Question

1 answers

solution1 2 ACCPTED 2017-10-04 14:17:15

solution1
2 ACCPTED 2017-10-04 14:17:15