简体   繁体   English

熊猫read_csv-使用usecols选项运行无法读取

[英]Pandas read_csv - running with usecols option fails to read

I'm trying to read a big tsv file, using panda package. 我正在尝试使用panda包读取大的tsv文件。 The tsv was extracted from a zip file, which contains separately the header names. tsv是从zip文件中提取的,该文件单独包含标头名称。 It is not written by me - I got this file from an external source (it's a clickstream data). 它不是我写的-我是从外部来源获得的文件(这是点击流数据)。 I run this through jupyter notebook, on an amazon virtual instance. 我在亚马逊虚拟实例上通过jupyter笔记本运行此程序。

My code is as follows: 我的代码如下:

df = pd.read_csv(zf.open(item[:-4]), 
   compression = None, 
   sep = '\t',
   parse_dates = True,
   names = df_headers,
   usecols = columns_to_include,
   error_bad_lines = False)

df_headers are 680 fields which were provided on a spearate tsv. df_headers是在spearate tsv上提供的680个字段。 My problem is that I get hundreds of errors of the type: 我的问题是,我收到数百种类型的错误:

Skipping line 158548: expected 680 fields, saw 865 跳线158548:预计680场,锯865

Skipping line 181906: expected 680 fields, saw 865 跳线181906:预计680场,锯865

Skipping line 306190: expected 680 fields, saw 689 Skipping line 306191: expected 680 fields, saw 686 跳过线306190:预期680场,锯689跳过线306191:预期680场,锯686

Skipping line 469427: expected 680 fields, saw 1191 跳线469427:预计680场,锯1191

Skipping line 604104: expected 680 fields, saw 865 跳线604104:预计680场,锯865

and then the operation stops, with the following Traceback 然后操作停止,并进行以下回溯

raise ValueError('skip_footer not supported for iteration') 引发ValueError('skip_footer不支持迭代')

and then: pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)() 然后:pandas.parser.TextReader.read中的pandas / parser.pyx(pandas / parser.c:7988)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)() pandas.parser.TextReader._read_low_memory中的pandas / parser.pyx(pandas / parser.c:8244)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)() pandas.parser.TextReader._read_rows中的pandas / parser.pyx(pandas / parser.c:9261)()

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10190)() pandas.parser.TextReader._convert_column_data中的pandas / parser.pyx(pandas / parser.c:10190)()

CParserError: Too many columns specified: expected 680 and found 489 CParserError:指定的列过多:预期为680,发现为489

This is not the first file I'm reading in this way - I read a lot of files and usually got less than 10 such errors, which I could just ignore and read the files. 这不是我以这种方式读取的第一个文件-我读取了很多文件,通常收到少于10个此类错误,我可以忽略并读取这些文件。 I don't know why this time the number of problematic rows is so big and why the reading stops. 我不知道为什么这次有问题的行数如此之大,为什么读取停止。 How can I proceed? 我该如何进行? I can't even open the tsv because they are huge, and when I tried one of the tools which are supposed to be able to open big files - I could n't find the lines of the errors, as the row numbers were not similar to the ones reported in the errors...(ie I couldn't just go to row 158548 and see what is the problem there...) Any help would be VERY appreciated! 我什至无法打开tsv,因为它们很大,当我尝试一种应该可以打开大文件的工具时-我找不到错误的行,因为行号没有类似于错误中报告的错误...(即我不能只去158548行,看看那里出了什么问题...)任何帮助将不胜感激! This is quite crucial for me. 这对我来说至关重要。

Edited: When I run the read_csv without the usecols option (I tried it only on a subset of the big file) - it succeeds. 编辑:当我运行不带usecols选项的read_csv时(我只在大文件的一个子集上尝试过)-成功。 For some reason the usecols causes some problem for pandas to identify the real columns... I updated the pandas version to 0.19.2, as I saw that there were some bug fixes regarding the usecols option, but now I have a worse problem - when I run the read on a subset of the file (using nrows=) I get different results with or without usecols: with usecols I get the following error: 由于某种原因,usecols导致pandas识别实际列时遇到了一些问题...我将pandas版本更新为0.19.2,因为我看到有关usecols选项的一些错误修复,但现在我遇到了一个更严重的问题-当我在文件的一个子集上运行读取时(使用nrows =),无论是否使用usecols,我都会得到不同的结果:使用usecols时,我得到以下错误:

CParserError: Error tokenizing data. CParserError:标记数据时出错。 C error: Buffer overflow caught - possible malformed input file. C错误:捕获了缓冲区溢出-可能是格式错误的输入文件。

and now I even don't know in which line... 现在我什至不知道在哪一行...

If I run it without usecols I manage to read BUT - I manage to do it only for a subset of the data (200000 out of ~700000 lines) - when I try to read 200000 rows each time, and then append the created Data Frames I get a memory problem error..... 如果我在不使用usecols的情况下运行它,那么我将设法读取BUT-我设法仅对一部分数据进行处理(约700000行中的200000行)-当我每次尝试读取200000行,然后追加创建的数据帧时我收到内存问题错误.....

The number of usecols columns is around 100, and the number of overall columns is almost 700. I have dozens of such files, where each one has around 700000 lines. usecols列的数量大约为100,而总列的数量几乎为700。我有几十个这样的文件,每个文件大约有700000行。

Answering to a specific case: when you are load a dataframe in pandas without header (labeled dataframes of pass/fail occurences, etc), the file has a large number of columns and there are some empty columns, so there will occur an issue during the process (Error: too many columns specified...) 回答特定情况:当您在没有标题的情况下在熊猫中加载数据帧(带有通过/失败事件的标记数据帧等)时,该文件包含大量列,并且有一些空列,因此在运行过程中会出现问题流程(错误:指定的列过多...)

For this case/purpose try to use: 对于这种情况/目的,请尝试使用:

df = pd.read_csv('file.csv', header=None, low_memory=False)

low_memory permit you load all of this data with complete empty columns even in sequence. low_memory允许您按顺序甚至用完整的空列加载所有这些数据。

Notes: 笔记:

  • considering pandas imported as pd 考虑将熊猫作为钯进口

  • considering your files in the same directory of your jupyter notebook 考虑文件在jupyter笔记本的同一目录中

  • laptop with 16GB RAM memory + i5 vPro 2 cores 具有16GB RAM内存+ i5 vPro 2核的笔记本电脑

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM