熊猫read_csv-使用usecols选项运行无法读取

Question

I'm trying to read a big tsv file, using panda package. 我正在尝试使用panda包读取大的tsv文件。 The tsv was extracted from a zip file, which contains separately the header names. tsv是从zip文件中提取的，该文件单独包含标头名称。 It is not written by me - I got this file from an external source (it's a clickstream data). 它不是我写的-我是从外部来源获得的文件（这是点击流数据）。 I run this through jupyter notebook, on an amazon virtual instance. 我在亚马逊虚拟实例上通过jupyter笔记本运行此程序。

My code is as follows: 我的代码如下：

df = pd.read_csv(zf.open(item[:-4]), 
   compression = None, 
   sep = '\t',
   parse_dates = True,
   names = df_headers,
   usecols = columns_to_include,
   error_bad_lines = False)

df_headers are 680 fields which were provided on a spearate tsv. df_headers是在spearate tsv上提供的680个字段。 My problem is that I get hundreds of errors of the type: 我的问题是，我收到数百种类型的错误：

Skipping line 158548: expected 680 fields, saw 865 跳线158548：预计680场，锯865

Skipping line 181906: expected 680 fields, saw 865 跳线181906：预计680场，锯865

Skipping line 306190: expected 680 fields, saw 689 Skipping line 306191: expected 680 fields, saw 686 跳过线306190：预期680场，锯689跳过线306191：预期680场，锯686

Skipping line 469427: expected 680 fields, saw 1191 跳线469427：预计680场，锯1191

Skipping line 604104: expected 680 fields, saw 865 跳线604104：预计680场，锯865

and then the operation stops, with the following Traceback 然后操作停止，并进行以下回溯

raise ValueError('skip_footer not supported for iteration') 引发ValueError（'skip_footer不支持迭代'）

and then: pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)() 然后：pandas.parser.TextReader.read中的pandas / parser.pyx（pandas / parser.c：7988）（）

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)() pandas.parser.TextReader._read_low_memory中的pandas / parser.pyx（pandas / parser.c：8244）（）

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)() pandas.parser.TextReader._read_rows中的pandas / parser.pyx（pandas / parser.c：9261）（）

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10190)() pandas.parser.TextReader._convert_column_data中的pandas / parser.pyx（pandas / parser.c：10190）（）

CParserError: Too many columns specified: expected 680 and found 489 CParserError：指定的列过多：预期为680，发现为489

This is not the first file I'm reading in this way - I read a lot of files and usually got less than 10 such errors, which I could just ignore and read the files. 这不是我以这种方式读取的第一个文件-我读取了很多文件，通常收到少于10个此类错误，我可以忽略并读取这些文件。 I don't know why this time the number of problematic rows is so big and why the reading stops. 我不知道为什么这次有问题的行数如此之大，为什么读取停止。 How can I proceed? 我该如何进行？ I can't even open the tsv because they are huge, and when I tried one of the tools which are supposed to be able to open big files - I could n't find the lines of the errors, as the row numbers were not similar to the ones reported in the errors...(ie I couldn't just go to row 158548 and see what is the problem there...) Any help would be VERY appreciated! 我什至无法打开tsv，因为它们很大，当我尝试一种应该可以打开大文件的工具时-我找不到错误的行，因为行号没有类似于错误中报告的错误...（即我不能只去158548行，看看那里出了什么问题...）任何帮助将不胜感激！ This is quite crucial for me. 这对我来说至关重要。

Edited: When I run the read_csv without the usecols option (I tried it only on a subset of the big file) - it succeeds. 编辑：当我运行不带usecols选项的read_csv时（我只在大文件的一个子集上尝试过）-成功。 For some reason the usecols causes some problem for pandas to identify the real columns... I updated the pandas version to 0.19.2, as I saw that there were some bug fixes regarding the usecols option, but now I have a worse problem - when I run the read on a subset of the file (using nrows=) I get different results with or without usecols: with usecols I get the following error: 由于某种原因，usecols导致pandas识别实际列时遇到了一些问题...我将pandas版本更新为0.19.2，因为我看到有关usecols选项的一些错误修复，但现在我遇到了一个更严重的问题-当我在文件的一个子集上运行读取时（使用nrows =），无论是否使用usecols，我都会得到不同的结果：使用usecols时，我得到以下错误：

CParserError: Error tokenizing data. CParserError：标记数据时出错。 C error: Buffer overflow caught - possible malformed input file. C错误：捕获了缓冲区溢出-可能是格式错误的输入文件。

and now I even don't know in which line... 现在我什至不知道在哪一行...

If I run it without usecols I manage to read BUT - I manage to do it only for a subset of the data (200000 out of ~700000 lines) - when I try to read 200000 rows each time, and then append the created Data Frames I get a memory problem error..... 如果我在不使用usecols的情况下运行它，那么我将设法读取BUT-我设法仅对一部分数据进行处理（约700000行中的200000行）-当我每次尝试读取200000行，然后追加创建的数据帧时我收到内存问题错误.....

The number of usecols columns is around 100, and the number of overall columns is almost 700. I have dozens of such files, where each one has around 700000 lines. usecols列的数量大约为100，而总列的数量几乎为700。我有几十个这样的文件，每个文件大约有700000行。

Answer 1

Answering to a specific case: when you are load a dataframe in pandas without header (labeled dataframes of pass/fail occurences, etc), the file has a large number of columns and there are some empty columns, so there will occur an issue during the process (Error: too many columns specified...) 回答特定情况：当您在没有标题的情况下在熊猫中加载数据帧（带有通过/失败事件的标记数据帧等）时，该文件包含大量列，并且有一些空列，因此在运行过程中会出现问题流程（错误：指定的列过多...）

For this case/purpose try to use: 对于这种情况/目的，请尝试使用：

df = pd.read_csv('file.csv', header=None, low_memory=False)

low_memory permit you load all of this data with complete empty columns even in sequence. low_memory允许您按顺序甚至用完整的空列加载所有这些数据。

Notes: 笔记：

considering pandas imported as pd 考虑将熊猫作为钯进口
considering your files in the same directory of your jupyter notebook 考虑文件在jupyter笔记本的同一目录中
laptop with 16GB RAM memory + i5 vPro 2 cores 具有16GB RAM内存+ i5 vPro 2核的笔记本电脑

熊猫read_csv-使用usecols选项运行无法读取

问题描述

1 个解决方案

解决方案1
0 2018-08-30 18:14:31

熊猫read_csv-使用usecols选项运行无法读取

问题描述

1 个解决方案

解决方案1 0 2018-08-30 18:14:31

解决方案1
0 2018-08-30 18:14:31