简体   繁体   English

python文件open()引发非utf-8字符的异常

[英]python file open() throws exception for non utf-8 character

I wrote the simplest python program that exhibits the error I need help with. 我写了最简单的python程序,显示了需要帮助的错误。

lines_read = 0
urllist_file = open('../fall11_urls.txt', 'r')

for line in urllist_file:
    lines_read += 1
print('line count:', lines_read)

I run this on most files and of course it works as expected but "fall11_urls.txt" is a 14 million line text file that contains URLs, one per line. 我在大多数文件上运行此文件,当然它可以按预期运行,但是“ fall11_urls.txt”是一个包含URL的1400万行文本文件,每行一个。 Some of these lines contain text with appeaently non utf-8 characters and I get the error quoted below. 其中一些行包含的文字似乎不是utf-8字符,我在下面引用了错误。 I need access every one of these URLs What is the best way to handle this. 我需要访问这些URL中的每个URL。什么是处理此问题的最佳方法。 These URLs can be "anything" some are 400 characters of random characters as in " https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg Some of these string contain characters such as 0x96 I need my python program to be robust against whatever might be in the file. (If it matters this runs on Ubuntu 16.04) 这些URL可以是“任何”,有些是400个随机字符,例如“ https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg”中的某些字符串,例如0x96我需要我的python程序对文件中的任何内容都具有较强的鲁棒性(如果这很重要,则可以在Ubuntu 16.04上运行)

Here is the error 这是错误

Traceback (most recent call last):
  File "./count_lines.py", line 2, in <module>
    for line in urllist_file:
  File "/home/chris/.virtualenvs/cvml3/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 5529: invalid start byte

One more bit of information iconv finds the same problem with the same file. 一点点信息iconv对于同一文件发现相同的问题。 See below 见下文

$ iconv ../fall11_urls.txt >> /dev/null
iconv: illegal input sequence at position 1042953625

My current work around is UGLY. 我当前的工作方式很丑。 I use iconv to find the problem then I hand edit the file in vi, then process it. 我使用iconv查找问题,然后在vi中手动编辑文件,然后进行处理。 and keep doing this until it is clean but I have MILLIONS of lines in many files to process. 并继续这样做直到它干净为止,但是我要处理的许多文件中都有数百万行。 And the URLs do mostly work after I hand correct them so these are not noise or "flipped bits". 在我手动纠正它们之后,URL基本上可以正常工作,因此它们不是杂音或“翻转位”。

Answering my own question to let you all know what worked. 回答我自己的问题,让大家都知道有效的方法。 Yes opening in binary worked I tried it but then I don't have a "text" file. 是的,在二进制文件中打开是可以的,但是我没有一个“文本”文件。 I read up on encoding and found the following works because every binary character value is valid. 我仔细阅读了编码,因为每个二进制字符值都是有效的,因此可以找到以下工作。 It is the Safest thing to do. 这是最安全的事情。

urllist_file = open('../fall11_urls.txt', 'r',   encoding="latin-1")

It seems that anyone opening text files they get from other people and have no way to control or know in advance what is inside might be advised to use "latin-1" because there are no invalid byte values in Latin-1. 似乎任何人打开从别人那里得到的文本文件,都无法控制或事先知道里面的内容时,建议您使用“ latin-1”,因为在Latin-1中没有无效的字节值。

Thanks. 谢谢。 The suggestion to open in binary got me to investigate what other parameters open() accepts. 用二进制打开的建议使我研究了open()接受的其他参数。 I'm new to Python and was astounded to find that strings are just a list of bytes. 我是Python的新手,很惊讶地发现字符串只是字节列表。 (That is what 20+ years of working in C will condition you to expect.) (这是您在C语言中工作20年以上的条件所期望的。)

Did you try crook method? 您尝试过骗子方法吗? This should work. 这应该工作。 urllist_file = open('../fall11_urls.txt', 'rb') then convert to whatever format you want urllist_file = open('../ fall11_urls.txt','rb')然后转换为所需的任何格式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM