简体   繁体   中英

Dealing with Parse Errors when reading in csv via dask.dataframe

I am working with a massive csv file (>3million rows, 76 columns) and have decided to use dask to read the data before converting to a pandas dataframe.

However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below.

import dask.dataframe as dd
import pandas as pd


dataframe = dd.read_csv("SAS url",
                       delimiter = ",", 
                       encoding = "UTF-8", blocksize = 25e6,
                       engine = 'python') 


Then to see if all the columns are present I use

dataframe.columns

When using


dataframe.compute()

I see the following error:

ParseError image

When using the read_csv parameter error_bad_lines = False , it shows that many of the rows have 77 or 78 fields instead of the expected 76.

Note: Omitting these faulty rows is unfortunately not an option.

Solution I am seeking

Is there a way to keep all the fields and append these extra fields to new columns when necessary?

Yes there is. You can use the names= parameter to add extra columns before you read the full CSV. I have not tried this with Dask but Dask read_csv calls Pandas read_csv under the covers so this should be applicable to dd.read_csv as well.

To demonstrate using a simulated CSV file:

sim_csv = io.StringIO(
'''A,B,C
11,21,31
12,22,32
13,23,33,43,53
14,24,34
15,25,35'''
)

By default, read_csv fails:

df = pd.read_csv(sim_csv)

ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 5

Capture the column names:

sim_csv.seek(0)    # Not needed for a real CSV file
df = pd.read_csv(sim_csv, nrows=1)

save_cols = df.columns.to_list()

Add a couple column names to the end of the names list and read your CSV:

sim_csv.seek(0)    # Not needed for a real CSV file
df = pd.read_csv(sim_csv, skiprows=1, names=save_cols+['D','E'])

df

    A   B   C     D     E
0  11  21  31   NaN   NaN
1  12  22  32   NaN   NaN
2  13  23  33  43.0  53.0
3  14  24  34   NaN   NaN
4  15  25  35   NaN   NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM