简体   繁体   English

我可以检测字符串中使用的文本编解码器吗?

[英]Can I detect the text codec used in a string?

I'm reading a string from a file(which anybody can modify) and don't know which type of coded the string is. 我正在从文件中读取一个字符串(任何人都可以修改),并且不知道该字符串是哪种编码类型。 Is there any function like 有什么功能吗

 getCodec = mystring.getCodec()

which return something like 返回类似的东西

 getCodec = 'utf-8' 

or 要么

getCodec = 'ascii'

?

No, there is no such function, because files do not record what codec was used to write the text contained. 不,没有这样的功能,因为文件没有记录使用哪种编解码器写入包含的文本。

If there is more context (like a more specific format such as HTML or XML) then you can determine the codec because the standard specifies a default or allows for annotating the data with the codec, but otherwise you are reduced to guessing based on the contents (which is what tools like chardet do). 如果存在更多上下文(例如更特定的格式,例如HTML或XML),则可以确定编解码器,因为该标准指定了默认值或允许使用编解码器注释数据,但否则,您将只能基于内容进行猜测(这是像chardet这样的工具)。

For a file that anyone can modify, you have no hope but to document clearly what codec should be used. 对于任何人都可以修改的文件,除了明确说明应使用哪种编解码器外,您别无选择。

You can use a 3rd-party chardet module. 您可以使用3rd-party chardet模块。

>>> import chardet
>>> chardet.detect(b'\xed\x95\x9c\xea\xb8\x80')  # u'한글'.encode('utf-8')
{'confidence': 0.7525, 'encoding': 'utf-8'}
>>> chardet.detect(b'\xc7\xd1\xb1\xdb')
{'confidence': 0.99, 'encoding': 'EUC-KR'}  # u'한글'.encode('euc-kr')

NOTE: chardet is not foolproof, and if a file is small enough can easily guess wrong. 注意: chardet并非万无一失,如果文件足够小,很容易猜错。

If you cannot use chardet and have no chance of specifying the encoding in advance, I think your only remaining recourse is to simply guess at it. 如果您不能使用chardet ,并且没有机会预先指定编码,那么我认为您剩下的唯一途径就是简单地猜测一下。 You could do something like this: 您可以执行以下操作:

# Add whichever you want to the list, but only end it in a codec like latin1 that never fails
codecs = ["utf-8", "euc-kr", "shift-jis", "latin1"]

def try_decode(text):
    for codec in codecs:
        try:
            return text.decode(codec)
        except UnicodeError:
            continue

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM