Pandas read_csv - running with usecols option fails to read

Question

I'm trying to read a big tsv file, using panda package. The tsv was extracted from a zip file, which contains separately the header names. It is not written by me - I got this file from an external source (it's a clickstream data). I run this through jupyter notebook, on an amazon virtual instance.

My code is as follows:

df = pd.read_csv(zf.open(item[:-4]), 
   compression = None, 
   sep = '\t',
   parse_dates = True,
   names = df_headers,
   usecols = columns_to_include,
   error_bad_lines = False)

df_headers are 680 fields which were provided on a spearate tsv. My problem is that I get hundreds of errors of the type:

Skipping line 158548: expected 680 fields, saw 865

Skipping line 181906: expected 680 fields, saw 865

Skipping line 306190: expected 680 fields, saw 689 Skipping line 306191: expected 680 fields, saw 686

Skipping line 469427: expected 680 fields, saw 1191

Skipping line 604104: expected 680 fields, saw 865

and then the operation stops, with the following Traceback

raise ValueError('skip_footer not supported for iteration')

and then: pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)()

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10190)()

CParserError: Too many columns specified: expected 680 and found 489

This is not the first file I'm reading in this way - I read a lot of files and usually got less than 10 such errors, which I could just ignore and read the files. I don't know why this time the number of problematic rows is so big and why the reading stops. How can I proceed? I can't even open the tsv because they are huge, and when I tried one of the tools which are supposed to be able to open big files - I could n't find the lines of the errors, as the row numbers were not similar to the ones reported in the errors...(ie I couldn't just go to row 158548 and see what is the problem there...) Any help would be VERY appreciated! This is quite crucial for me.

Edited: When I run the read_csv without the usecols option (I tried it only on a subset of the big file) - it succeeds. For some reason the usecols causes some problem for pandas to identify the real columns... I updated the pandas version to 0.19.2, as I saw that there were some bug fixes regarding the usecols option, but now I have a worse problem - when I run the read on a subset of the file (using nrows=) I get different results with or without usecols: with usecols I get the following error:

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

and now I even don't know in which line...

If I run it without usecols I manage to read BUT - I manage to do it only for a subset of the data (200000 out of ~700000 lines) - when I try to read 200000 rows each time, and then append the created Data Frames I get a memory problem error.....

The number of usecols columns is around 100, and the number of overall columns is almost 700. I have dozens of such files, where each one has around 700000 lines.

Answer 1

Answering to a specific case: when you are load a dataframe in pandas without header (labeled dataframes of pass/fail occurences, etc), the file has a large number of columns and there are some empty columns, so there will occur an issue during the process (Error: too many columns specified...)

For this case/purpose try to use:

df = pd.read_csv('file.csv', header=None, low_memory=False)

low_memory permit you load all of this data with complete empty columns even in sequence.

Notes:

considering pandas imported as pd
considering your files in the same directory of your jupyter notebook
laptop with 16GB RAM memory + i5 vPro 2 cores

Pandas read_csv - running with usecols option fails to read

Question

1 answers

solution1
0 2018-08-30 18:14:31

Pandas read_csv - running with usecols option fails to read

Question

1 answers

solution1 0 2018-08-30 18:14:31

solution1
0 2018-08-30 18:14:31