简体   繁体   中英

Pandas read_csv with incorrect header delimiter

I have a csv file, say test.csv and looks like this

Data 1 | Data 2 | Data 3 |
1      | 2      | 3      |      |      |    6

If I open up the csv in notepad++ or text, it shows up like this

Data 1, Data 2, Data 3
1,2,3,,,6

instead of

Data 1, Data 2, Data 3, ,,
1,2,3,,,6

I get an error using pandas.read_csv because my rows have more columns than declared in my header row. Are there ways to find the maximum number of columns of the entire CSV?

My proposition is to:

  • skip the title row,
  • specify your own list of column names.

So the code should be:

df = pd.read_csv(io.StringIO(txt), skiprows=1,
    names=['Data 1', 'Data 2', 'Data 3', 'Data 4', 'Data 5', 'Data 6'])

A detail to consider is how should be represented empty cells.

The default behaviour of read_csv is to treat them as NaN , but the downside is then that such columns are coerced to float .

Another possibility is to pass na_filter=False parameter and then:

  • such empty strings are represented "as is" (as an empty string),
  • but columns with these values are this time coerced to object , because string cells are intermixed with int .

Note also that the format of your title row is corrupted in one more detail. The source of problem are spaces after commas: They become the initial part of the next column name.

So even if you had just 3 columns of data, the column names would be:

['Data 1', ' Data 2', ' Data 3']

(note the leading spaces). So this is yet another reason to skip the title row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM