简体   繁体   中英

Pandas read_csv - running with usecols option fails to read

I'm trying to read a big tsv file, using panda package. The tsv was extracted from a zip file, which contains separately the header names. It is not written by me - I got this file from an external source (it's a clickstream data). I run this through jupyter notebook, on an amazon virtual instance.

My code is as follows:

df = pd.read_csv(zf.open(item[:-4]), 
   compression = None, 
   sep = '\t',
   parse_dates = True,
   names = df_headers,
   usecols = columns_to_include,
   error_bad_lines = False)

df_headers are 680 fields which were provided on a spearate tsv. My problem is that I get hundreds of errors of the type:

Skipping line 158548: expected 680 fields, saw 865

Skipping line 181906: expected 680 fields, saw 865

Skipping line 306190: expected 680 fields, saw 689 Skipping line 306191: expected 680 fields, saw 686

Skipping line 469427: expected 680 fields, saw 1191

Skipping line 604104: expected 680 fields, saw 865

and then the operation stops, with the following Traceback

raise ValueError('skip_footer not supported for iteration')

and then: pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)()

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10190)()

CParserError: Too many columns specified: expected 680 and found 489

This is not the first file I'm reading in this way - I read a lot of files and usually got less than 10 such errors, which I could just ignore and read the files. I don't know why this time the number of problematic rows is so big and why the reading stops. How can I proceed? I can't even open the tsv because they are huge, and when I tried one of the tools which are supposed to be able to open big files - I could n't find the lines of the errors, as the row numbers were not similar to the ones reported in the errors...(ie I couldn't just go to row 158548 and see what is the problem there...) Any help would be VERY appreciated! This is quite crucial for me.

Edited: When I run the read_csv without the usecols option (I tried it only on a subset of the big file) - it succeeds. For some reason the usecols causes some problem for pandas to identify the real columns... I updated the pandas version to 0.19.2, as I saw that there were some bug fixes regarding the usecols option, but now I have a worse problem - when I run the read on a subset of the file (using nrows=) I get different results with or without usecols: with usecols I get the following error:

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

and now I even don't know in which line...

If I run it without usecols I manage to read BUT - I manage to do it only for a subset of the data (200000 out of ~700000 lines) - when I try to read 200000 rows each time, and then append the created Data Frames I get a memory problem error.....

The number of usecols columns is around 100, and the number of overall columns is almost 700. I have dozens of such files, where each one has around 700000 lines.

Answering to a specific case: when you are load a dataframe in pandas without header (labeled dataframes of pass/fail occurences, etc), the file has a large number of columns and there are some empty columns, so there will occur an issue during the process (Error: too many columns specified...)

For this case/purpose try to use:

df = pd.read_csv('file.csv', header=None, low_memory=False)

low_memory permit you load all of this data with complete empty columns even in sequence.

Notes:

  • considering pandas imported as pd

  • considering your files in the same directory of your jupyter notebook

  • laptop with 16GB RAM memory + i5 vPro 2 cores

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM