I'm trying to read a big tsv file, using panda package. The tsv was extracted from a zip file, which contains separately the header names. It is not written by me - I got this file from an external source (it's a clickstream data). I run this through jupyter notebook, on an amazon virtual instance.
My code is as follows:
df = pd.read_csv(zf.open(item[:-4]),
compression = None,
sep = '\t',
parse_dates = True,
names = df_headers,
usecols = columns_to_include,
error_bad_lines = False)
df_headers are 680 fields which were provided on a spearate tsv. My problem is that I get hundreds of errors of the type:
Skipping line 158548: expected 680 fields, saw 865
Skipping line 181906: expected 680 fields, saw 865
Skipping line 306190: expected 680 fields, saw 689 Skipping line 306191: expected 680 fields, saw 686
Skipping line 469427: expected 680 fields, saw 1191
Skipping line 604104: expected 680 fields, saw 865
and then the operation stops, with the following Traceback
raise ValueError('skip_footer not supported for iteration')
and then: pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)()
pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10190)()
CParserError: Too many columns specified: expected 680 and found 489
This is not the first file I'm reading in this way - I read a lot of files and usually got less than 10 such errors, which I could just ignore and read the files. I don't know why this time the number of problematic rows is so big and why the reading stops. How can I proceed? I can't even open the tsv because they are huge, and when I tried one of the tools which are supposed to be able to open big files - I could n't find the lines of the errors, as the row numbers were not similar to the ones reported in the errors...(ie I couldn't just go to row 158548 and see what is the problem there...) Any help would be VERY appreciated! This is quite crucial for me.
Edited: When I run the read_csv without the usecols option (I tried it only on a subset of the big file) - it succeeds. For some reason the usecols causes some problem for pandas to identify the real columns... I updated the pandas version to 0.19.2, as I saw that there were some bug fixes regarding the usecols option, but now I have a worse problem - when I run the read on a subset of the file (using nrows=) I get different results with or without usecols: with usecols I get the following error:
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
and now I even don't know in which line...
If I run it without usecols I manage to read BUT - I manage to do it only for a subset of the data (200000 out of ~700000 lines) - when I try to read 200000 rows each time, and then append the created Data Frames I get a memory problem error.....
The number of usecols columns is around 100, and the number of overall columns is almost 700. I have dozens of such files, where each one has around 700000 lines.
Answering to a specific case: when you are load a dataframe in pandas without header (labeled dataframes of pass/fail occurences, etc), the file has a large number of columns and there are some empty columns, so there will occur an issue during the process (Error: too many columns specified...)
For this case/purpose try to use:
df = pd.read_csv('file.csv', header=None, low_memory=False)
low_memory
permit you load all of this data with complete empty columns even in sequence.
Notes:
considering pandas imported as pd
considering your files in the same directory of your jupyter notebook
laptop with 16GB RAM memory + i5 vPro 2 cores
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.