加载大型 csv 文件时 Python Pandas 解析器错误

Question

I am learning about loading large csv files into python via pandas.我正在学习如何通过 pandas 将大型 csv 文件加载到 python 中。 I am using anaconda and python 3 with a pc with 64 GB of RAM.我正在使用 anaconda 和 python 3 和 64 GB RAM 的电脑。

The Loan_Portfolio_Example_Large.csv dataset consists of 2509 columns and 100,000 rows and is approximately 1.4 GBs. Loan_Portfolio_Example_Large.csv 数据集由 2509 列和 100,000 行组成，大小约为 1.4 GB。

I can run the following code without error:我可以运行以下代码而不会出错：

MyList=[]
Chunk_Size = 10000
for chunk in pd.read_csv('Loan_Portfolio_Example_Large.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

However, when I use Loan_Portfolio_Example_Large.csv file to create a larger file, namely, Loan_Portfolio_Example_Larger.csv, the following code produces an error.但是，当我使用 Loan_Portfolio_Example_Large.csv 文件创建一个更大的文件，即 Loan_Portfolio_Example_Larger.csv 时，以下代码会产生错误。

Note that all I am doing to create the Larger file is I am copying the 100,000 rows from Loan_Portfolio_Example_Large.csv and pasting them 4 times (ie, pasting in lower rows in excel and saving as csv) to create a file that consists of 500,000 rows and 2509 columns (this file is about 4.2 GB).请注意，我为创建更大的文件所做的只是从 Loan_Portfolio_Example_Large.csv 复制 100,000 行并将它们粘贴 4 次（即，粘贴到 excel 中的较低行并保存为包含 005 行的文件）和 2509 列（这个文件大约 4.2 GB）。

The following code creates a parser error and I am unsure why since the data has only gotten larger, I haven't changed the structure of the csv file in any other way, I should have plenty of memory, and I increased the chunk size which shouldn't cause any issues.下面的代码会产生一个解析器错误，我不确定为什么由于数据只是变得更大，我没有以任何其他方式更改 csv 文件的结构，我应该有很多 memory，我增加了块大小不应该引起任何问题。

Any thoughts?有什么想法吗？ I wonder if the csv is somehow getting corrupted when it is saved (given it is so large.)我想知道 csv 在保存时是否以某种方式损坏（考虑到它太大了。）

MyList=[]
Chunk_Size = 100000
for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

Error output:错误 output：

--------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size): 5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time)) -------------------------------------------------- ------------------------- ParserError Traceback (最近一次调用最后一次) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size) 中的块：5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time))

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next (self) 1126 def next (self): 1127 try: -> 1128 return self.get_chunk() 1129 except StopIteration: 1130 self.close() C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next (self) 1126 def next (self): 1127 try: -> 1128 return self.get_chunk() 1129 除了 StopIteration: 1130 self 。关（）

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration
1187 size = min(size, self.nrows - self._currow) -> 1188 return self.read(nrows=size) 1189 1190 1187 大小 = 最小（大小，self.nrows - self._currow）-> 1188 返回 self.read（nrows=size）1189 1190

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153 C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153
nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # May alter columns / col_dict nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # 可能改变列 / col_dict

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058 C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058
try: -> 2059 data = self._reader.read(nrows) 2060 except StopIteration: 2061 if self._first_chunk:尝试：-> 2059 data = self._reader.read(nrows) 2060 除了 StopIteration: 2061 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read() pandas._libs.parsers.TextReader.read() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas._libs.parsers.TextReader._read_low_memory() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas._libs.parsers.TextReader._read_rows() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows() pandas._libs.parsers.TextReader._tokenize_rows() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error() pandas._libs.parsers.raise_parser_error() 中的 pandas_libs\parsers.pyx

ParserError: Error tokenizing data. ParserError：错误标记数据。 C error: Expected 2509 fields in line 145134, saw 3802 C 错误：预计第 145134 行中有 2509 个字段，看到 3802

Answer 1

Seems like the record 145134 has some delimiter characters in the data and is making it look like it has more columns.似乎记录 145134 在数据中有一些分隔符，并使它看起来有更多的列。 Try to use read_csv with the parameters below so it will let you know about the records with issues but it will not stop the process.尝试将read_csv与以下参数一起使用，这样它会让您了解有问题的记录，但不会停止该过程。

pd.read_csv('Loan_Portfolio_Example_Large.csv', 
             chunksize=Chunk_Size, 
             error_bad_lines=False,
             warn_bad_lines=True)

加载大型 csv 文件时 Python Pandas 解析器错误

问题描述

1 个解决方案

解决方案1
1 2020-07-29 22:15:19

加载大型 csv 文件时 Python Pandas 解析器错误

问题描述

1 个解决方案

解决方案1 1 2020-07-29 22:15:19

解决方案1
1 2020-07-29 22:15:19