简体   繁体   English

在熊猫中打开损坏的CSV文件的策略

[英]Strategy to open a corrupt csv file in pandas

I have got a bunch of csv files that I am loading in Pandas just fine, but one file is acting up I'm opening it this way : 我在Pandas中加载了一堆csv文件,但是我正在以这种方式打开它,但是正在运行一个文件:

df = pd.DataFrame.from_csv(csv_file)

error: 错误:

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1268, in from_csv encoding=encoding,tupleize_cols=False) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in parser_f return _read(filepath_or_buffer, kwds) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 198, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 479, in init self._make_engine(self.engine) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 586, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 957, in init self._reader = _p 文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py”,行1268,from_csv encoding = encoding,tupleize_cols = False)文件“ / Library / Frameworks / Python.framework / Versions / 2.7 / lib / python2.7 / site-packages / pandas / io / parsers.py“,parser_f中的第400行,返回_read(filepath_or_buffer,kwds)文件” / Library / Frameworks / Python.framework / Versions / 2.7 / lib / python2.7 / site-packages / pandas / io / parsers.py“,第198行,位于_read解析器= TextFileReader(filepath_or_buffer,** kwds)文件” / Library / Frameworks / Python .framework / Versions / 2.7 / lib / python2.7 / site-packages / pandas / io / parsers.py“,第479行,位于初始 self._make_engine(self.engine)文件“ /Library/Frameworks/Python.framework/版本/2.7/lib/python2.7/site-packages/pandas/io/parsers.py”,第586行,位于_make_engine self._engine = CParserWrapper(self.f,** self.options)文件“ / Library / Frameworks”中/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py”,第957行, init self._reader = _p arser.TextReader(src, **kwds) File "parser.pyx", line 477, in pandas.parser.TextReader. arser.TextReader(src,** kwds)文件“ parser.pyx”,第477行,位于pandas.parser.TextReader中。 cinit (pandas/parser.c:4434) File "parser.pyx", line 599, in pandas.parser.TextReader._get_header (pandas/parser.c:5831) pandas.parser.CParserError: Passed header=0 but only 0 lines in file CINIT(熊猫/ parser.c:4434)文件“parser.pyx”,线路599,在pandas.parser.TextReader._get_header(熊猫/ parser.c:5831)pandas.parser.CParserError:已通过头部= 0而只0文件中的行

To me, this means that there is some sort of corruption in the file, having a quick look is seems fine, it is a big file though and visually checking every single line is not an option, what would be a good strategy to troubleshoot a csv file that pandas won't open ? 对我而言,这意味着文件中存在某种损坏,快速浏览似乎还不错,尽管它是一个大文件,并且目视检查每一行都不是一种选择,这是解决故障的一个好策略熊猫无法打开的csv文件?

thank you 谢谢

Looks like pandas assigns line 0 as the header. 看起来熊猫将行0分配为标题。 Try calling: 尝试致电:

df = pd.DataFrame.from_csv(csv_file,header=None)

or 要么

    df = pd.DataFrame.read_csv(csv_file,header=None)

However, it's strange that the files seems to have zero lines (ie it's empty). 但是,奇怪的是文件似乎只有零行(即,它是空的)。 Maybe the filepath is wrong? 也许文件路径是错误的?

if in Linux open it with head in the operating system to inspect it then fix it with awk or sed.. if in windows, you could also try vim to inspect and fix it. 如果在Linux中,请先在操作系统中打开它进行检查,然后用awk或sed进行修复。如果在Windows中,您也可以尝试使用vim进行检查和修复。 In short it probably is not best to fix the file in Pandas. 简而言之,最好不要在Pandas中修复该文件。 You most likely have odd line endings (since the error message says 0 lines) so heading the file or cat or using Vim is needed to determine the line endings so that you can decide how best to fix or handle. 您很可能具有奇数行尾(因为错误消息显示为0行),因此需要使文件或目录标题或使用Vim来确定行尾,以便您决定如何最好地修复或处理。

I encountered the issue like you: 我遇到了像您这样的问题:


/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.13.1_601_g4663353-py2.7-macosx-10.9-x86_64.egg/pandas/io/parsers.pyc in init (self, src, **kwds) 970 kwds['allow_leading_cols'] = self.index_col is not False 971 --> 972 self._reader = _parser.TextReader(src, **kwds) 973 974 # XXX /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.13.1_601_g4663353-py2.7-macosx-10.9-x86_64.egg 初始化中的 /pandas/io/parsers.pyc(self,src,** kwds)970 kwds ['allow_leading_cols'] = self.index_col不是False 971-> 972 self._reader = _parser.TextReader(src,** kwds)973974#XXX

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.13.1_601_g4663353-py2.7-macosx-10.9-x86_64.egg/pandas/parser.so in pandas.parser.TextReader. /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.13.1_601_g4663353-py2.7-macosx-10.9-x86_64.egg /pandas/parser.so在pandas.parser.TextReader中。 cinit (pandas/parser.c:4628)() CINIT(熊猫/ parser.c:4628)()

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.13.1_601_g4663353-py2.7-macosx-10.9-x86_64.egg/pandas/parser.so in pandas.parser.TextReader._get_header (pandas/parser.c:6068)() /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.13.1_601_g4663353-py2.7-macosx-10.9-x86_64.egg pandas.parser.TextReader._get_header中的/pandas/parser.so(pandas / parser.c:6068)()

CParserError: Passed header=0 but only 0 lines in file CParserError:传递的标头= 0,但文件中只有0行


My code is: 我的代码是:

df = pd.read_csv('/Users/steven/Documents/Mywork/Python/sklearn/beer/data') df = pd.read_csv('/ Users / steven / Documents / Mywork / Python / sklearn / beer / data')

Finally, I found I have made a mistake: I sent a path of directory instead of file to read_csv . 最后,我发现自己犯了一个错误:我将目录的路径而不是文件发送到read_csv

The correct code is: 正确的代码是:

df = pd.read_csv('/Users/steven/Documents/Mywork/Python/sklearn/beer/data/beer_reviews.csv') df = pd.read_csv('/ Users / steven / Documents / Mywork / Python / sklearn / beer / data / beer_reviews.csv')

It runs right. 它运行正确。

So, I think the reason of your issue lies in the file you sent. 因此,我认为问题的原因在于您发送的文件。 Maybe it is path of directory just as I have done. 也许就像我所做的那样,它是目录的路径。 Maybe the file is empty or corrupt, or in wrong encoding set. 可能文件为空或损坏,或设置了错误的编码。

I hope the above is helpful to you. 希望以上对您有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM