繁体   English   中英

当 python pandas.read_csv 在 azure 上时,编码没有改变

[英]When python pandas.read_csv on azure, encoding is not changing

通过使用 python pandas 读取 csv 文件,并尝试更改编码,由于一些德国字母,接缝 Azure 始终保持相同的编码(假设默认)。

无论我做了什么,总是在 Azure 门户上遇到相同的错误: “utf-8”编解码器无法解码位置 0 的字节 0xc4:无效的连续字节堆栈

即使我设置了 uft-16、latin1、cp1252 等,也会出现同样的错误。

with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
  for i in sftp.listdir_attr():
     with sftp.open(i.filename) as f:
        df = pd.read_csv(f, delimiter=';', encoding='cp1252')

顺便说一下,在 Windows 机器上本地测试它,它工作正常。

完整错误:

Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor( 
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", 
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 542, in __run_sync_func return func(**params) 
File "/home/site/wwwroot/ce_etl/etl_main.py", 
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py", 
line 311, in wrapper return func(*args, **kwargs) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 586, in read_csv return _read(filepath_or_buffer, kwds) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 488, in _read return parser.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 1047, in read index, columns, col_dict = self._engine.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py", 
line 223, in read chunks = self._reader.read_low_memory(nrows) 
File "pandas/_libs/parsers.pyx", 
line 801, in pandas._libs.parsers.TextReader.read_low_memory 
File "pandas/_libs/parsers.pyx", 
line 880, in pandas._libs.parsers.TextReader._read_rows 
File "pandas/_libs/parsers.pyx", 
line 1026, in pandas._libs.parsers.TextReader._convert_column_data 
File "pandas/_libs/parsers.pyx", 
line 1080, in pandas._libs.parsers.TextReader._convert_tokens 
File "pandas/_libs/parsers.pyx", 
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype 
File "pandas/_libs/parsers.pyx", 
line 1217, in pandas._libs.parsers.TextReader._string_convert 
File "pandas/_libs/parsers.pyx", 
line 1396, in pandas._libs.parsers._string_box_utf8

您可以使用如下编码:

read_csv('file', encoding = "ISO-8859-1")

另外,如果我们想检测文件自己的编码并将其放入 read_csv 中,我们可以将其添加如下:

result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])

请参阅 Python Pandas 文档中的read_csv

我找到了解决方案。 基本上 sftp.open 默认保持 utf-8。 为什么 Azure Linux 无法更改 read_csv 方法中的编码仍然是一个问题。

使用 sftp.getfo 作为对象读取文件,然后解析为 df 可以正常工作:

 ba = io.BytesIO()
 sftp.getfo(i.filename, ba)
 ba.seek(0)

 f = io.TextIOWrapper(ba, encoding='cp1252')
 df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str, 
                  error_bad_lines=False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM