简体   繁体   English

python无法编码奇怪的字符

[英]python failing to encode strange characters

I am attempting to parse txt file > 2GB by running following script: 我正在尝试通过运行以下脚本来解析txt文件> 2GB:

#!/usr/bin/env python
import json

def convert2json(filename):
    with open(filename) as I:
        for line in I:
            d = {"data": line}
            print(json.dumps(d, ensure_ascii=False))

if __name__ == "__main__":
    import sys

    convert2json(sys.argv[1])

Script throws error: 脚本抛出错误:

Traceback (most recent call last):
  File "ori.py", line 13, in <module>
    convert2json(sys.argv[1])
  File "ori.py", line 8, in convert2json
    print(json.dumps(d))
  File "/usr/lib/python2.7/json/__init__.py", line 244, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 31: invalid continuation byte

and it fails (I believe) when processing special characters: 并且在处理特殊字符时失败(我相信):

<E0>ทำคาม:
:<E9>皇甫
:<E9>皇甫:<E9>皇甫:<E9>皇甫

How can I make the script to just ignore the lines that are causing problems? 如何使脚本仅忽略引起问题的行?

When I go to the file that I am parsing and copy a big chunk of the lines that cannot be processed, create a new file and run the script again - than it works. 当我转到要解析的文件并复制无法处理的大部分行时,请创建一个新文件并再次运行该脚本-它将起作用。 I am doing that by copying lines from less and than to a file using vi . 我这样做是通过使用vi将行从less行复制到文件。 Am I doing something when copying the lines with the encoding itself? 复制带有编码本身的行时,我是否正在做某些事情?

Ok, you are using Python 2, so what you read from the file is a byte string. 好的,您正在使用Python 2,因此从文件中读取的是字节字符串。 More according to the error message , you have the default for ensure_ascii parameter which is true. 根据错误消息的更多信息 ,您具有默认的ensure_ascii参数,该参数为true。 In that case, all strings are decoded with the default encoding (utf8). 在这种情况下,所有字符串都将使用默认编码(utf8)进行解码。 If your input is not utf8 encode you get a UnicodeDecodeError . 如果您的输入不是utf8编码,则会得到UnicodeDecodeError

What can be done? 该怎么办?

If you are not sure of the initial encoding and just want to leave everything as is you can just declare a Latin1 encoding. 如果您不确定初始编码,只是想保留所有内容,则可以声明Latin1编码。 It just changes every byte in the unicode character having that code. 它只是更改具有该代码的unicode字符中的每个字节。 The ensure_ascii is a bit different: it just allows any byte in the resulting json string, which may lead to non portable Json: ensure_ascii有点不同:它只允许结果json字符串中的任何字节,这可能导致不可移植的Json:

The RFC does not explicitly forbid JSON strings which contain byte sequences that don't correspond to valid Unicode characters (eg unpaired UTF-16 surrogates), but it does note that they may cause interoperability problems. RFC没有明确禁止包含不与有效Unicode字符相对应的字节序列的JSON字符串(例如,未配对的UTF-16替代),但是它确实指出它们可能会导致互操作性问题。 By default, this module accepts and outputs (when present in the original str) code points for such sequences. 默认情况下,此模块接受并输出此类序列的代码点(如果存在于原始str中)。

So this is a bullet proof way: 因此,这是一种防弹方式:

def convert2json(filename):
    with open(filename) as I:
        for line in I:
            d = {"data": line}
            print(json.dumps(d, encoding='Latin1'))

Simply a non ascii character in the input file ,say '\\x..' , will be coded in the Json as '\\u00..' 在输入文件中,仅将非ASCII字符(例如'\\x..'在Json中编码为'\\u00..'

Well, to skip those lines you can use this: 好吧,要跳过这些行,您可以使用以下命令:

#!/usr/bin/env python
import json

def convert2json(filename):
    with open(filename) as I:
        for line in I:
            try:
                d = {"data": line}
                print(json.dumps(d, ensure_ascii=False))
            except:
                continue

if __name__ == "__main__":
    import sys

    convert2json(sys.argv[1])  

I have wrapped the code in the loop in a try/except block. 我已经将代码包装在try / except块中的循环中。
This way when an error occurs, it will be muffled. 这样,当发生错误时,它将被忽略。 You will see no output from the current line and the script will continue to the next one. 您将不会从当前行看到任何输出,并且脚本将继续进行到下一行。

However, my tests didn't throw an error when tried with the provided part of the file. 但是,尝试使用文件的提供的部分时,我的测试没有引发错误。 Could you tell us what the encoding of your file is? 您能告诉我们文件的编码是什么吗? Are you sure that the problem is caused by those characters? 您确定问题是由这些字符引起的吗? Try adding a print() statement in your original code and a counter so you can identify the correct invalid line. 尝试在原始代码和计数器中添加print()语句,以便您可以识别正确的无效行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM