简体   繁体   English

Pandas CParserError:标记数据时出错

[英]Pandas CParserError: Error tokenizing data

I have a large csv file with 25 columns, that I want to read as a pandas dataframe.我有一个包含 25 列的大型 csv 文件,我想将其作为熊猫数据框读取。 I am using pandas.read_csv() .我正在使用pandas.read_csv() The problem is that some rows have extra columns, something like that:问题是某些行有额外的列,如下所示:

        col1   col2   stringColumn   ...   col25
1        12      1       str1                 3
...
33657    2       3       str4                 6       4    3 #<- that line has a problem
33658    1      32       blbla                 #<-some columns have missing data too 

When I try to read it, I get the error当我尝试阅读它时,出现错误

CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28

The problem does not happen if the extra values appear in the first rows.如果额外值出现在第一行中,则不会发生此问题。 For example if I add values to the third row of the same file it works fine例如,如果我将值添加到同一文件的第三行,它工作正常

#that example works: 
           col1   col2   stringColumn   ...   col25
    1        12      1       str1                 3
    2        12      1       str1                 3
    3        12      1       str1                 3       f    4
    ...
    33657    2       3       str4                 6       4    3 #<- that line has a problem
    33658    1      32       blbla                 #<-some columns have missing data too 

My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.我的猜测是,pandas 检查前 (n) 行以确定列数,如果之后还有额外的列,则解析它时会出现问题。

Skipping the offending lines like suggested here is not an option, those lines contain valuable information.这里建议的那样跳过有问题的行不是一种选择,这些行包含有价值的信息。

Does anybody know a way around this?有人知道解决这个问题的方法吗?

Since I did not find an answer that completely solves the problem, here is my work around: I found out that explicitly passing the column names with the option names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '') allows me to read the file.由于我没有找到完全解决问题的答案,这是我的解决方法:我发现使用选项names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '')允许我读取文件。 It forces me to read and parse every column, which is not ideal since I only need about half of them, but at least I can read the file now.它迫使我阅读和解析每一列,这并不理想,因为我只需要其中的一半左右,但至少我现在可以读取文件。 Combinining the arguments names and usecols and does not work, if somebody has another solution I would be happy to hear it.结合参数namesusecols并不起作用,如果有人有其他解决方案,我会很高兴听到它。

In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv.在我最初的帖子中,我提到在 pandas.read_csv 中不使用“error_bad_lines”= False。 I decided that actually doing so is the more proper and elegant solution.我决定实际上这样做是更合适和优雅的解决方案。 I found this post quite useful.我发现这篇文章很有用。

Can I redirect the stdout in python into some sort of string buffer? 我可以将 python 中的 stdout 重定向到某种字符串缓冲区吗?

I added a little twist to the code shown in the answer.我对答案中显示的代码进行了一些改动。

import sys
import re
from cStringIO import StringIO
import pandas as pd

fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()

df1 = pd.read_csv(StringIO(fake_csv),
                  error_bad_lines=False)

sys.stderr = old_stderr 
log = mystderr.getvalue()
isnum = re.compile("\d+")

lines_skipped_log = [
    isnum.findall(i) + [fname]\
    for i in log.split("\n") if isnum.search(i)
        ]

columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)

From there you can do anything you want with lines_skipped_log such as output to csv, create a dataframe etc.从那里你可以用lines_skipped_log做任何你想做的lines_skipped_log比如输出到csv,创建一个数据lines_skipped_log等。

Perhaps you have a directory full of files.也许你有一个充满文件的目录。 You can create a list of pandas data frames out of each log and concatenate.您可以从每个日志中创建一个熊猫数据框列表并进行连接。 From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).从那里,您将获得跳过哪些行以及哪些文件触手可及(字面意思!)的日志。

A possible workaround is to specify the column names.一种可能的解决方法是指定列名称。 Please refer my answer to a similar issue: https://stackoverflow.com/a/43145539/6466550请参考我对类似问题的回答: https : //stackoverflow.com/a/43145539/6466550

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python Pandas CParserError错误标记数据 - Python Pandas CParserError Error tokenizing data CParserError:错误标记数据 - CParserError: Error tokenizing data 熊猫错误:pandas.io.common.CParserError:标记数据时出错 - Error with pandas: pandas.io.common.CParserError: Error tokenizing data 熊猫加载文本文件错误:CParserError:错误标记数据 - Pandas load text file error: CParserError: Error tokenizing data 在 pandas 中读取 csv 文件时出错 [CParserError: 标记数据时出错。 C 错误:捕获缓冲区溢出 - 可能是格式错误的输入文件。] - Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.] pandas.io.common.CParserError:标记数据时出错。 C错误:捕获了缓冲区溢出-可能是格式错误的输入文件 - pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file CParserError:标记数据时出错。 阅读跨书数据集时 - CParserError: Error tokenizing data. when reading book-crossing dataset Python Pandas 错误标记数据 - Python Pandas Error tokenizing data Pandas,ParserError:错误标记数据 - Pandas, ParserError: Error tokenizing data pandas.errors.ParserError:标记数据时出错 - pandas.errors.ParserError: Error tokenizing data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM