当 python pandas.read_csv 在 azure 上时，编码没有改变

Question

通过使用 python pandas 读取 csv 文件，并尝试更改编码，由于一些德国字母，接缝 Azure 始终保持相同的编码（假设默认）。

无论我做了什么，总是在 Azure 门户上遇到相同的错误： “utf-8”编解码器无法解码位置 0 的字节 0xc4：无效的连续字节堆栈

即使我设置了 uft-16、latin1、cp1252 等，也会出现同样的错误。

with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
  for i in sftp.listdir_attr():
     with sftp.open(i.filename) as f:
        df = pd.read_csv(f, delimiter=';', encoding='cp1252')

顺便说一下，在 Windows 机器上本地测试它，它工作正常。

完整错误：

Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor( 
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", 
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 542, in __run_sync_func return func(**params) 
File "/home/site/wwwroot/ce_etl/etl_main.py", 
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py", 
line 311, in wrapper return func(*args, **kwargs) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 586, in read_csv return _read(filepath_or_buffer, kwds) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 488, in _read return parser.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 1047, in read index, columns, col_dict = self._engine.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py", 
line 223, in read chunks = self._reader.read_low_memory(nrows) 
File "pandas/_libs/parsers.pyx", 
line 801, in pandas._libs.parsers.TextReader.read_low_memory 
File "pandas/_libs/parsers.pyx", 
line 880, in pandas._libs.parsers.TextReader._read_rows 
File "pandas/_libs/parsers.pyx", 
line 1026, in pandas._libs.parsers.TextReader._convert_column_data 
File "pandas/_libs/parsers.pyx", 
line 1080, in pandas._libs.parsers.TextReader._convert_tokens 
File "pandas/_libs/parsers.pyx", 
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype 
File "pandas/_libs/parsers.pyx", 
line 1217, in pandas._libs.parsers.TextReader._string_convert 
File "pandas/_libs/parsers.pyx", 
line 1396, in pandas._libs.parsers._string_box_utf8

Answer 1

您可以使用如下编码：

read_csv('file', encoding = "ISO-8859-1")

另外，如果我们想检测文件自己的编码并将其放入 read_csv 中，我们可以将其添加如下：

result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])

请参阅 Python Pandas 文档中的read_csv

Answer 2

我找到了解决方案。 基本上 sftp.open 默认保持 utf-8。 为什么 Azure Linux 无法更改 read_csv 方法中的编码仍然是一个问题。

使用 sftp.getfo 作为对象读取文件，然后解析为 df 可以正常工作：

 ba = io.BytesIO()
 sftp.getfo(i.filename, ba)
 ba.seek(0)

 f = io.TextIOWrapper(ba, encoding='cp1252')
 df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str, 
                  error_bad_lines=False)

当 python pandas.read_csv 在 azure 上时，编码没有改变

问题描述

2 个解决方案

解决方案1
0 2021-11-09 09:19:00

解决方案2
0 已采纳 2021-11-16 14:03:45

当 python pandas.read_csv 在 azure 上时，编码没有改变

问题描述

2 个解决方案

解决方案1 0 2021-11-09 09:19:00

解决方案2 0 已采纳 2021-11-16 14:03:45

解决方案1
0 2021-11-09 09:19:00

解决方案2
0 已采纳 2021-11-16 14:03:45