I am working with a massive csv file (>3million rows, 76 columns) and have decided to use dask to read the data before converting to a pandas dataframe.
However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below.
import dask.dataframe as dd
import pandas as pd
dataframe = dd.read_csv("SAS url",
delimiter = ",",
encoding = "UTF-8", blocksize = 25e6,
engine = 'python')
Then to see if all the columns are present I use
dataframe.columns
When using
dataframe.compute()
I see the following error:
When using the read_csv parameter error_bad_lines = False
, it shows that many of the rows have 77 or 78 fields instead of the expected 76.
Note: Omitting these faulty rows is unfortunately not an option.
Is there a way to keep all the fields and append these extra fields to new columns when necessary?
Yes there is. You can use the names=
parameter to add extra columns before you read the full CSV. I have not tried this with Dask
but Dask
read_csv
calls Pandas read_csv
under the covers so this should be applicable to dd.read_csv
as well.
To demonstrate using a simulated CSV file:
sim_csv = io.StringIO(
'''A,B,C
11,21,31
12,22,32
13,23,33,43,53
14,24,34
15,25,35'''
)
By default, read_csv
fails:
df = pd.read_csv(sim_csv)
ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Capture the column names:
sim_csv.seek(0) # Not needed for a real CSV file
df = pd.read_csv(sim_csv, nrows=1)
save_cols = df.columns.to_list()
Add a couple column names to the end of the names list and read your CSV:
sim_csv.seek(0) # Not needed for a real CSV file
df = pd.read_csv(sim_csv, skiprows=1, names=save_cols+['D','E'])
df
A B C D E
0 11 21 31 NaN NaN
1 12 22 32 NaN NaN
2 13 23 33 43.0 53.0
3 14 24 34 NaN NaN
4 15 25 35 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.