When python pandas.read_csv on azure, encoding is not changing

Question

By reading csv file with python pandas, and try to change encoding, because of some Germans letters, seams Azure always keep same encoding (assuming default).

Whatever I've done, always get same error on Azure portal: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte Stack

Same error appears even if I set, uft-16, latin1, cp1252 etc.

with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
  for i in sftp.listdir_attr():
     with sftp.open(i.filename) as f:
        df = pd.read_csv(f, delimiter=';', encoding='cp1252')

By the way, testing this locally on windows machine, it works fine.

Full error:

Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor( 
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", 
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 542, in __run_sync_func return func(**params) 
File "/home/site/wwwroot/ce_etl/etl_main.py", 
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py", 
line 311, in wrapper return func(*args, **kwargs) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 586, in read_csv return _read(filepath_or_buffer, kwds) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 488, in _read return parser.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 1047, in read index, columns, col_dict = self._engine.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py", 
line 223, in read chunks = self._reader.read_low_memory(nrows) 
File "pandas/_libs/parsers.pyx", 
line 801, in pandas._libs.parsers.TextReader.read_low_memory 
File "pandas/_libs/parsers.pyx", 
line 880, in pandas._libs.parsers.TextReader._read_rows 
File "pandas/_libs/parsers.pyx", 
line 1026, in pandas._libs.parsers.TextReader._convert_column_data 
File "pandas/_libs/parsers.pyx", 
line 1080, in pandas._libs.parsers.TextReader._convert_tokens 
File "pandas/_libs/parsers.pyx", 
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype 
File "pandas/_libs/parsers.pyx", 
line 1217, in pandas._libs.parsers.TextReader._string_convert 
File "pandas/_libs/parsers.pyx", 
line 1396, in pandas._libs.parsers._string_box_utf8

Answer 1

You can use encoding as below:

read_csv('file', encoding = "ISO-8859-1")

Also if we would like to detect the own encoding of the file and place it in read_csv, we can add it as below:

result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])

Refer to read_csv from Python Pandas documentations

Answer 2

I found solution. Basically sftp.open keeps utf-8 by default. Why Azure Linux can't change encoding in read_csv method is still remaining a question.

Reading file as object with sftp.getfo, and then parsing to df would work fine:

 ba = io.BytesIO()
 sftp.getfo(i.filename, ba)
 ba.seek(0)

 f = io.TextIOWrapper(ba, encoding='cp1252')
 df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str, 
                  error_bad_lines=False)

When python pandas.read_csv on azure, encoding is not changing

Question

2 answers

solution1
0 2021-11-09 09:19:00

solution2
0 ACCPTED 2021-11-16 14:03:45

When python pandas.read_csv on azure, encoding is not changing

Question

2 answers

solution1 0 2021-11-09 09:19:00

solution2 0 ACCPTED 2021-11-16 14:03:45

solution1
0 2021-11-09 09:19:00

solution2
0 ACCPTED 2021-11-16 14:03:45