python文件open（）引发非utf-8字符的异常

Question

I wrote the simplest python program that exhibits the error I need help with. 我写了最简单的python程序，显示了需要帮助的错误。

lines_read = 0
urllist_file = open('../fall11_urls.txt', 'r')

for line in urllist_file:
    lines_read += 1
print('line count:', lines_read)

I run this on most files and of course it works as expected but "fall11_urls.txt" is a 14 million line text file that contains URLs, one per line. 我在大多数文件上运行此文件，当然它可以按预期运行，但是“ fall11_urls.txt”是一个包含URL的1400万行文本文件，每行一个。 Some of these lines contain text with appeaently non utf-8 characters and I get the error quoted below. 其中一些行包含的文字似乎不是utf-8字符，我在下面引用了错误。 I need access every one of these URLs What is the best way to handle this. 我需要访问这些URL中的每个URL。什么是处理此问题的最佳方法。 These URLs can be "anything" some are 400 characters of random characters as in " https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg Some of these string contain characters such as 0x96 I need my python program to be robust against whatever might be in the file. (If it matters this runs on Ubuntu 16.04) 这些URL可以是“任何”，有些是400个随机字符，例如“ https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg”中的某些字符串，例如0x96我需要我的python程序对文件中的任何内容都具有较强的鲁棒性（如果这很重要，则可以在Ubuntu 16.04上运行）

Here is the error 这是错误

Traceback (most recent call last):
  File "./count_lines.py", line 2, in <module>
    for line in urllist_file:
  File "/home/chris/.virtualenvs/cvml3/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 5529: invalid start byte

One more bit of information iconv finds the same problem with the same file. 一点点信息iconv对于同一文件发现相同的问题。 See below 见下文

$ iconv ../fall11_urls.txt >> /dev/null
iconv: illegal input sequence at position 1042953625

My current work around is UGLY. 我当前的工作方式很丑。 I use iconv to find the problem then I hand edit the file in vi, then process it. 我使用iconv查找问题，然后在vi中手动编辑文件，然后进行处理。 and keep doing this until it is clean but I have MILLIONS of lines in many files to process. 并继续这样做直到它干净为止，但是我要处理的许多文件中都有数百万行。 And the URLs do mostly work after I hand correct them so these are not noise or "flipped bits". 在我手动纠正它们之后，URL基本上可以正常工作，因此它们不是杂音或“翻转位”。

Answer 1

Answering my own question to let you all know what worked. 回答我自己的问题，让大家都知道有效的方法。 Yes opening in binary worked I tried it but then I don't have a "text" file. 是的，在二进制文件中打开是可以的，但是我没有一个“文本”文件。 I read up on encoding and found the following works because every binary character value is valid. 我仔细阅读了编码，因为每个二进制字符值都是有效的，因此可以找到以下工作。 It is the Safest thing to do. 这是最安全的事情。

urllist_file = open('../fall11_urls.txt', 'r',   encoding="latin-1")

It seems that anyone opening text files they get from other people and have no way to control or know in advance what is inside might be advised to use "latin-1" because there are no invalid byte values in Latin-1. 似乎任何人打开从别人那里得到的文本文件，都无法控制或事先知道里面的内容时，建议您使用“ latin-1”，因为在Latin-1中没有无效的字节值。

Thanks. 谢谢。 The suggestion to open in binary got me to investigate what other parameters open() accepts. 用二进制打开的建议使我研究了open（）接受的其他参数。 I'm new to Python and was astounded to find that strings are just a list of bytes. 我是Python的新手，很惊讶地发现字符串只是字节列表。 (That is what 20+ years of working in C will condition you to expect.) （这是您在C语言中工作20年以上的条件所期望的。）

Answer 2

Did you try crook method? 您尝试过骗子方法吗？ This should work. 这应该工作。 urllist_file = open('../fall11_urls.txt', 'rb') then convert to whatever format you want urllist_file = open（'../ fall11_urls.txt'，'rb'）然后转换为所需的任何格式

python文件open（）引发非utf-8字符的异常

问题描述

2 个解决方案

解决方案1
0 2017-09-03 04:59:18

解决方案2
-1 2017-09-02 04:40:00

python文件open（）引发非utf-8字符的异常

问题描述

2 个解决方案

解决方案1 0 2017-09-03 04:59:18

解决方案2 -1 2017-09-02 04:40:00

解决方案1
0 2017-09-03 04:59:18

解决方案2
-1 2017-09-02 04:40:00