[英]Pandas CParserError: Error tokenizing data
I have a large csv file with 25 columns, that I want to read as a pandas dataframe.我有一个包含 25 列的大型 csv 文件,我想将其作为熊猫数据框读取。 I am using pandas.read_csv()
.我正在使用pandas.read_csv()
。 The problem is that some rows have extra columns, something like that:问题是某些行有额外的列,如下所示:
col1 col2 stringColumn ... col25
1 12 1 str1 3
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
When I try to read it, I get the error当我尝试阅读它时,出现错误
CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28
The problem does not happen if the extra values appear in the first rows.如果额外值出现在第一行中,则不会发生此问题。 For example if I add values to the third row of the same file it works fine例如,如果我将值添加到同一文件的第三行,它工作正常
#that example works:
col1 col2 stringColumn ... col25
1 12 1 str1 3
2 12 1 str1 3
3 12 1 str1 3 f 4
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.我的猜测是,pandas 检查前 (n) 行以确定列数,如果之后还有额外的列,则解析它时会出现问题。
Skipping the offending lines like suggested here is not an option, those lines contain valuable information.像这里建议的那样跳过有问题的行不是一种选择,这些行包含有价值的信息。
Does anybody know a way around this?有人知道解决这个问题的方法吗?
Since I did not find an answer that completely solves the problem, here is my work around: I found out that explicitly passing the column names with the option names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '')
allows me to read the file.由于我没有找到完全解决问题的答案,这是我的解决方法:我发现使用选项names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '')
允许我读取文件。 It forces me to read and parse every column, which is not ideal since I only need about half of them, but at least I can read the file now.它迫使我阅读和解析每一列,这并不理想,因为我只需要其中的一半左右,但至少我现在可以读取文件。 Combinining the arguments names
and usecols
and does not work, if somebody has another solution I would be happy to hear it.结合参数names
和usecols
并不起作用,如果有人有其他解决方案,我会很高兴听到它。
In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv.在我最初的帖子中,我提到在 pandas.read_csv 中不使用“error_bad_lines”= False。 I decided that actually doing so is the more proper and elegant solution.我决定实际上这样做是更合适和优雅的解决方案。 I found this post quite useful.我发现这篇文章很有用。
Can I redirect the stdout in python into some sort of string buffer? 我可以将 python 中的 stdout 重定向到某种字符串缓冲区吗?
I added a little twist to the code shown in the answer.我对答案中显示的代码进行了一些改动。
import sys
import re
from cStringIO import StringIO
import pandas as pd
fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()
df1 = pd.read_csv(StringIO(fake_csv),
error_bad_lines=False)
sys.stderr = old_stderr
log = mystderr.getvalue()
isnum = re.compile("\d+")
lines_skipped_log = [
isnum.findall(i) + [fname]\
for i in log.split("\n") if isnum.search(i)
]
columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)
From there you can do anything you want with lines_skipped_log
such as output to csv, create a dataframe etc.从那里你可以用lines_skipped_log
做任何你想做的lines_skipped_log
比如输出到csv,创建一个数据lines_skipped_log
等。
Perhaps you have a directory full of files.也许你有一个充满文件的目录。 You can create a list of pandas data frames out of each log and concatenate.您可以从每个日志中创建一个熊猫数据框列表并进行连接。 From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).从那里,您将获得跳过哪些行以及哪些文件触手可及(字面意思!)的日志。
A possible workaround is to specify the column names.一种可能的解决方法是指定列名称。 Please refer my answer to a similar issue: https://stackoverflow.com/a/43145539/6466550请参考我对类似问题的回答: https : //stackoverflow.com/a/43145539/6466550
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.