Pandas read_csv with incorrect header delimiter

Question

I have a csv file, say test.csv and looks like this

Data 1 | Data 2 | Data 3 |
1      | 2      | 3      |      |      |    6

If I open up the csv in notepad++ or text, it shows up like this

Data 1, Data 2, Data 3
1,2,3,,,6

instead of

Data 1, Data 2, Data 3, ,,
1,2,3,,,6

I get an error using pandas.read_csv because my rows have more columns than declared in my header row. Are there ways to find the maximum number of columns of the entire CSV?

Answer 1

My proposition is to:

skip the title row,
specify your own list of column names.

So the code should be:

df = pd.read_csv(io.StringIO(txt), skiprows=1,
    names=['Data 1', 'Data 2', 'Data 3', 'Data 4', 'Data 5', 'Data 6'])

A detail to consider is how should be represented empty cells.

The default behaviour of read_csv is to treat them as NaN , but the downside is then that such columns are coerced to float .

Another possibility is to pass na_filter=False parameter and then:

such empty strings are represented "as is" (as an empty string),
but columns with these values are this time coerced to object , because string cells are intermixed with int .

Note also that the format of your title row is corrupted in one more detail. The source of problem are spaces after commas: They become the initial part of the next column name.

So even if you had just 3 columns of data, the column names would be:

['Data 1', ' Data 2', ' Data 3']

(note the leading spaces). So this is yet another reason to skip the title row.

Pandas read_csv with incorrect header delimiter

Question

1 answers

solution1
0 2019-09-27 19:43:06

Pandas read_csv with incorrect header delimiter

Question

1 answers

solution1 0 2019-09-27 19:43:06

solution1
0 2019-09-27 19:43:06