Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

Question

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically &lt ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?

I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.

df = pd.read_csv(file_loc.csv,
                 header=None, 
                 names=column_names, 
                 usecols=counters, 
                 dtype=dtypes,
                 delimiter=';', 
                 low_memory=False)

ValueError: could not convert string to float:

As my first column is a string and the second is a float, but if the first is split by the &lt ; it then goes on the 2nd too.

Is there a way to tell pandas to ignore these or efficiently remove before loading?

Answer 1

Give the following example csv file so57732330.csv :

col1;col2
1&lt;2;a
3;

we read it using StringIO after unescaping named and numeric html5 character references:

import pandas as pd
import io
import html

with open('so57732330.csv') as f:
    s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')

Result:

  col1 col2
0  1<2    a
1    3  NaN

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

Question

1 answers

solution1
1 ACCPTED 2019-08-30 21:20:42

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

Question

1 answers

solution1 1 ACCPTED 2019-08-30 21:20:42

solution1
1 ACCPTED 2019-08-30 21:20:42