简体   繁体   English

加载大型 csv 文件时 Python Pandas 解析器错误

[英]Python Pandas Parser Error when loading large csv file

I am learning about loading large csv files into python via pandas.我正在学习如何通过 pandas 将大型 csv 文件加载到 python 中。 I am using anaconda and python 3 with a pc with 64 GB of RAM.我正在使用 anaconda 和 python 3 和 64 GB RAM 的电脑。

The Loan_Portfolio_Example_Large.csv dataset consists of 2509 columns and 100,000 rows and is approximately 1.4 GBs. Loan_Portfolio_Example_Large.csv 数据集由 2509 列和 100,000 行组成,大小约为 1.4 GB。

I can run the following code without error:我可以运行以下代码而不会出错:

MyList=[]
Chunk_Size = 10000
for chunk in pd.read_csv('Loan_Portfolio_Example_Large.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

However, when I use Loan_Portfolio_Example_Large.csv file to create a larger file, namely, Loan_Portfolio_Example_Larger.csv, the following code produces an error.但是,当我使用 Loan_Portfolio_Example_Large.csv 文件创建一个更大的文件,即 Loan_Portfolio_Example_Larger.csv 时,以下代码会产生错误。

Note that all I am doing to create the Larger file is I am copying the 100,000 rows from Loan_Portfolio_Example_Large.csv and pasting them 4 times (ie, pasting in lower rows in excel and saving as csv) to create a file that consists of 500,000 rows and 2509 columns (this file is about 4.2 GB).请注意,我为创建更大的文件所做的只是从 Loan_Portfolio_Example_Large.csv 复制 100,000 行并将它们粘贴 4 次(即,粘贴到 excel 中的较低行并保存为包含 005 行的文件)和 2509 列(这个文件大约 4.2 GB)。

The following code creates a parser error and I am unsure why since the data has only gotten larger, I haven't changed the structure of the csv file in any other way, I should have plenty of memory, and I increased the chunk size which shouldn't cause any issues.下面的代码会产生一个解析器错误,我不确定为什么由于数据只是变得更大,我没有以任何其他方式更改 csv 文件的结构,我应该有很多 memory,我增加了块大小不应该引起任何问题。

Any thoughts?有什么想法吗? I wonder if the csv is somehow getting corrupted when it is saved (given it is so large.)我想知道 csv 在保存时是否以某种方式损坏(考虑到它太大了。)

MyList=[]
Chunk_Size = 100000
for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

Error output:错误 output:

--------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size): 5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time)) -------------------------------------------------- ------------------------- ParserError Traceback (最近一次调用最后一次) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size) 中的块:5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time))

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next (self) 1126 def next (self): 1127 try: -> 1128 return self.get_chunk() 1129 except StopIteration: 1130 self.close() C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next (self) 1126 def next (self): 1127 try: -> 1128 return self.get_chunk() 1129 除了 StopIteration: 1130 self 。关()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration
1187 size = min(size, self.nrows - self._currow) -> 1188 return self.read(nrows=size) 1189 1190 1187 大小 = 最小(大小,self.nrows - self._currow)-> 1188 返回 self.read(nrows=size)1189 1190

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153 C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153
nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # May alter columns / col_dict nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # 可能改变列 / col_dict

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058 C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058
try: -> 2059 data = self._reader.read(nrows) 2060 except StopIteration: 2061 if self._first_chunk:尝试:-> 2059 data = self._reader.read(nrows) 2060 除了 StopIteration: 2061 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read() pandas._libs.parsers.TextReader.read() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas._libs.parsers.TextReader._read_low_memory() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas._libs.parsers.TextReader._read_rows() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows() pandas._libs.parsers.TextReader._tokenize_rows() 中的 pandas_libs\parsers.pyx

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error() pandas._libs.parsers.raise_parser_error() 中的 pandas_libs\parsers.pyx

ParserError: Error tokenizing data. ParserError:错误标记数据。 C error: Expected 2509 fields in line 145134, saw 3802 C 错误:预计第 145134 行中有 2509 个字段,看到 3802

Seems like the record 145134 has some delimiter characters in the data and is making it look like it has more columns.似乎记录 145134 在数据中有一些分隔符,并使它看起来有更多的列。 Try to use read_csv with the parameters below so it will let you know about the records with issues but it will not stop the process.尝试将read_csv与以下参数一起使用,这样它会让您了解有问题的记录,但不会停止该过程。

pd.read_csv('Loan_Portfolio_Example_Large.csv', 
             chunksize=Chunk_Size, 
             error_bad_lines=False,
             warn_bad_lines=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM