简体   繁体   English

通过 Paramiko 从 SFTP 服务器将 CSV 文件读入 Pandas 失败,“'utf-8'编解码器无法解码字节......在位置......:无效的起始字节”

[英]Reading CSV file into Pandas from SFTP server via Paramiko fails with “'utf-8' codec can't decode byte … in position …: invalid start byte”

I'm trying to read a CSV file into Pandas from am SFTP server using Paramiko:我正在尝试使用 Paramiko 从 SFTP 服务器将 CSV 文件读入 Pandas:

with sftp.open(path + file.filename) as fp:
    fp_aux = pd.read_csv(fp, separator = '|')

But when attempting it, it throws this error:但是在尝试时,它会引发此错误:

'utf-8' codec can't decode byte 0xa3 in position 73: invalid start byte “utf-8”编解码器无法解码位置 73 中的字节 0xa3:起始字节无效

I've tried different encodings passing different parameters to the encoding argument of pd.read_csv function (unicode_escape, latin-1, latin1, latin, utf-8...).我尝试了不同的编码,将不同的参数传递给pd.read_csv函数的encoding参数(unicode_escape、latin-1、latin1、latin、utf-8...)。 I have also tried with engine='python' but no luck so far.我也尝试过engine='python'但到目前为止没有运气。 Is there anything else I can try?还有什么我可以尝试的吗? If not, how can I ignore the error and continue to the next line or next df?如果没有,我如何忽略错误并继续下一行或下一个 df?

This is happening only if I try to read from the SFTP server, it works fine if I read it from my local disk.仅当我尝试从 SFTP 服务器读取时才会发生这种情况,如果我从本地磁盘读取它就可以正常工作。

Complete callstack of the error:错误的完整调用堆栈:

UnicodeDecodeError                        Traceback (most recent call last)
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas\_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 83: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-41-53537b824736> in <module>
      1 with sftp.open(r'/Debtopdcarich/Mandatory File/MandatoryFile_190721.csv') as fp:
----> 2     fp_aux = (pd.read_csv(fp, encoding='iso-8859-1', sep='|'))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    603     kwds.update(kwds_defaults)
    604 
--> 605     return _read(filepath_or_buffer, kwds)
    606 
    607 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    461 
    462     with parser:
--> 463         return parser.read(nrows)
    464 
    465 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1050     def read(self, nrows=None):
   1051         nrows = validate_integer("nrows", nrows)
-> 1052         index, columns, col_dict = self._engine.read(nrows)
   1053 
   1054         if index is None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   2054     def read(self, nrows=None):
   2055         try:
-> 2056             data = self._reader.read(nrows)
   2057         except StopIteration:
   2058             if self._first_chunk:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas\_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 83: invalid start byte

Pandas seems to be somehow confused by the Paramiko file-like object API. Pandas 似乎被 Paramiko 类文件对象 API 弄糊涂了。 It does not use its encoding argument, when presented with Paramiko file-like object.当与 Paramiko 类文件对象一起呈现时,它不使用其encoding参数。

Quick and dirty solution is to read the remote file to in-memory file-like object and present that to Pandas.快速而肮脏的解决方案是将远程文件读取到内存中的类文件对象并将其呈现给 Pandas。 Then the encoding argument is used.然后使用encoding参数。

flo = BytesIO()
sftp.getfo(path + file.filename, flo)
flo.seek(0)
pd.read_csv(flo, separator = '|', encoding='iso-8859-1')

More efficient might be to build a wrapper class on top of Paramiko file-like object, with the API that Pandas can work with.更有效的可能是在 Paramiko 类文件对象之上构建一个包装类,使用 Pandas 可以使用的 API。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 1 的字节 0x8b:无效的起始字节,同时在 Pandas 中读取 csv 文件 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas UnicodeDecodeError when reading CSV file in Pandas with Python “'utf-8' codec can't decode byte 0xff in position 0: invalid start byte” - UnicodeDecodeError when reading CSV file in Pandas with Python “'utf-8' codec can't decode byte 0xff in position 0: invalid start byte” UnicodeDecodeError:'utf-8'编解码器无法解码 position 0 中的字节 0xff:读取 csv 时 python 中的无效起始字节错误 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte error in python while reading a csv file pandas csv UnicodeDecodeError: &#39;utf-8&#39; codec can&#39;t decode byte 0x81 in position 162: invalid start byte - pandas csv UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 162: invalid start byte “utf-8”编解码器无法解码位置 2912 中的字节 0xd5:在 Python 中读取 csv 文件时出现无效的连续字节错误 - 'utf-8' codec can't decode byte 0xd5 in position 2912: invalid continuation byte Error when reading csv file in Python “ utf-8”编解码器无法解码位置中的字节0x96…当通过熊猫read_csv读取文本文件时 - 'utf-8' codec can't decode byte 0x96 in position … when reading text file by pandas read_csv Python decode() 'utf-8' 编解码器无法解码 position 中的字节 0xff 0:无效的起始字节 - Python decode() 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte 'utf-8' 编解码器无法解码 position 中的字节 0x80 3131:无效的起始字节':在读取 xml 文件时 - 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte': while reading xml files pd.read_csv:utf-8&#39;编解码器无法解码位置61的字节0x98:无效的起始字节 - pd.read_csv: utf-8' codec can't decode byte 0x98 in position 61: invalid start byte CSV 到字节到 DF 以绕过 UnicodeDecodeError:'utf-8' 编解码器无法解码 position 中的字节 0xff 0:无效起始字节? - CSV to bytes to DF to bypass UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM