简体   繁体   English

如何使用python检测正确的文件编码?

[英]How to detect the right file encoding with python?

I try to read a csv (from https://openwrt.org/_media/toh_dump_tab_separated.zip ) in python with pandas using pandas.read_csv().我尝试使用 pandas.read_csv() 在 python 中读取 csv(来自https://openwrt.org/_media/toh_dump_tab_separated.zip )。 The problem is the encoding of the file.问题是文件的编码。 It is not UTF-8, it is not Latin1.它不是UTF-8,也不是Latin1。 And I don't want to go manually through all the codecs ( https://docs.python.org/3/library/codecs.html#standard-encodings ).而且我不想手动浏览所有编解码器( https://docs.python.org/3/library/codecs.html#standard-encodings )。

The workaround is opening the file in Libre Office, replacing weird characters with '-', saving as Latin1 and opening in Python.解决方法是在 Libre Office 中打开文件,用“-”替换奇怪的字符,另存为 Latin1 并在 Python 中打开。

How do I do it in Python only?我如何仅在 Python 中执行此操作?

The following code and error are my current status with UTF-8:以下代码和错误是我当前使用 UTF-8 的状态:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'utf-8')

(...) (……)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 983: invalid start byte UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 983 中的字节 0xbf:起始字节无效

and with Latin1:和拉丁语1:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'Latin1')

(...) (……)

ParserError: Error tokenizing data. ParserError:标记数据时出错。 C error: Expected 1 fields in line 3, saw 2 C 错误:第 3 行中应有 1 个字段,看到 2 个

Use sep parameter :使用sep参数

import pandas as pd
df = pd.read_csv('ToH_dump_tab_separated.csv', encoding = 'cp1252', sep='\t')
print(df)
 pid ... comments 0 16132 ... NaN 1 16133 ... NaN 2 16134 ... NaN 3 16135 ... Clone of Aztech HW550-3G 4 16137 ... Image build disabled in master with commit d7d... ... ... ... ... 1759 9726386 ... NaN 1760 9878711 ... Rough edges as of December 2020. Realtek targe... 1761 9912125 ... Works with WL-WN575A3 image according OpenWrt ... 1762 9927580 ... NaN 1763 9946488 ... NaN [1764 rows x 67 columns]

FYI, the weird character 0xbf is ¿ Inverted Question Mark U+00BF (or \¿ ):仅供参考,奇怪的字符0xbf¿倒问号U+00BF (或\¿ ):

print( df.switch[:2]); print( df.fccid[-2:])
 0 Infineon ADM6996I 1 ¿ Name: switch, dtype: object 1762 http://¿ 1763 https://fcc.io/Q87-03331 Name: fccid, dtype: object

Edit (tnx Mark Tolonen ).编辑(tnx Mark Tolonen )。 Encoding appears to be cp1252 .编码似乎是cp1252 There are smart quotes in some of the fields:在某些领域有聪明的报价:

print( df.comments[254][288:])
 Ignore the “HW v” on the label - it may not say 2 for v2 hardware

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM