[英]When python pandas.read_csv on azure, encoding is not changing
通过使用 python pandas 读取 csv 文件,并尝试更改编码,由于一些德国字母,接缝 Azure 始终保持相同的编码(假设默认)。
无论我做了什么,总是在 Azure 门户上遇到相同的错误: “utf-8”编解码器无法解码位置 0 的字节 0xc4:无效的连续字节堆栈
即使我设置了 uft-16、latin1、cp1252 等,也会出现同样的错误。
with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
for i in sftp.listdir_attr():
with sftp.open(i.filename) as f:
df = pd.read_csv(f, delimiter=';', encoding='cp1252')
顺便说一下,在 Windows 机器上本地测试它,它工作正常。
完整错误:
Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor(
File "/usr/local/lib/python3.8/concurrent/futures/thread.py",
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 542, in __run_sync_func return func(**params)
File "/home/site/wwwroot/ce_etl/etl_main.py",
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py",
line 311, in wrapper return func(*args, **kwargs)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv return _read(filepath_or_buffer, kwds)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 488, in _read return parser.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 1047, in read index, columns, col_dict = self._engine.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 223, in read chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx",
line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx",
line 880, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx",
line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx",
line 1080, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx",
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx",
line 1217, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx",
line 1396, in pandas._libs.parsers._string_box_utf8
您可以使用如下编码:
read_csv('file', encoding = "ISO-8859-1")
另外,如果我们想检测文件自己的编码并将其放入 read_csv 中,我们可以将其添加如下:
result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])
请参阅 Python Pandas 文档中的read_csv
我找到了解决方案。 基本上 sftp.open 默认保持 utf-8。 为什么 Azure Linux 无法更改 read_csv 方法中的编码仍然是一个问题。
使用 sftp.getfo 作为对象读取文件,然后解析为 df 可以正常工作:
ba = io.BytesIO()
sftp.getfo(i.filename, ba)
ba.seek(0)
f = io.TextIOWrapper(ba, encoding='cp1252')
df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str,
error_bad_lines=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.