由于解码错误，无法使用 Python 从 .txt 文件中读取行

Question

我想逐行读取 .txt 文件，但发生错误，说“gbk”编解码器无法解码位置 5195 中的字节 0x9d：非法多字节序列。

我不太明白这一点。 是否有多种方法可以解码 .txt 文件，因此我需要指定一些内容？ 或者我应该以某种方式转换 .txt 文件？

import urllib2

fname = urllib2.urlopen("https://www.gutenberg.org/files/1661/1661-0.txt")
for line in fname:
    print(line[0])

顺便说一句，我尝试下载 .txt 文件并从本地驱动器打开。 同样的问题。 有没有人见过这个？

Answer 1

如果您使用的是 python3，请使用它：

import urllib2

response = urllib2.urlopen("https://www.gutenberg.org/files/1661/1661-0.txt")

for line in response.decode('utf-8'):
    print(line[0])

或尝试请求包

import requests

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt").text

Answer 2

已接受答案的替代方案：

为 Python 3 使用`urllib.request`

如果您使用的是 Python 3，请不要使用urllib2 。 改用内置的urllib.request模块（无需安装任何东西）。

请参阅此处的注释：

注意：urllib2 模块在 Python 3 中被拆分为多个模块，名为 urllib.request 和 urllib.error。

要将书籍文本读入变量：

import urllib.request

book_url = "https://www.gutenberg.org/files/1661/1661-0.txt"

response = urllib.request.urlopen(book_url)
book_text = response.read().decode('utf-8')

或者，将整本书打印到终端：

import urllib.request

book_url = "https://www.gutenberg.org/files/1661/1661-0.txt"

with urllib.request.urlopen(book_url) as f:
    print(f.read().decode('utf-8'))

`requests`包

正如接受的答案所述，您可以为更高级别的 HTTP 接口安装和使用requests包。 但是，它仍然需要显式处理编码：

import requests

book_url = "https://www.gutenberg.org/files/1661/1661-0.txt"
r = requests.get(book_url)
r.encoding = 'utf-8'
response = r.text
print(response)

如果没有使用 UTF-8 的明确指令，结果可能会错误地处理某些字符，例如 Microsoft 所谓的智能/卷曲引号。 你可能会得到这样的东西......

Lestrade laughed. âI am afraid that I am still a sceptic,â he said.

...什么时候你应该得到这个：

Lestrade laughed. “I am afraid that I am still a sceptic,” he said.

为什么需要显式编码？

我们正在访问的 URL 指向一个显示小说文本的古腾堡计划网页。 但是，当我在浏览器中打开此页面时，数据显示不正确。 例如，我看到这个：

Lestrade laughed. â€œI am afraid that I am still a sceptic,â€ he said.

在网页顶部，我们看到以下内容：

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [EBook #1661]
Last Updated: May 20, 2019

Language: English

Character set encoding: UTF-8

因此，文本正文告诉我们该页面显然是使用 UTF-8 编码的。

但是，如果我们检查文档（例如使用 Forefox 的“检查元素”工具），我们会看到：

<head>
    <link rel="stylesheet" href="resource://content-accessible/plaintext.css">
</head>

没有指定编码：

<meta charset="UTF-8">

因此，当我们处理响应文本时，我们必须自己明确处理这个问题。 这确保 Python 将正确处理数据。

一旦数据离开 Python（例如，如果它被写入文件，或显示在终端中），那么该数据的用户（例如文件阅读器、终端显示）将需要确保他们在处理数据时也使用正确的编码。

由于解码错误，无法使用 Python 从 .txt 文件中读取行

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-03-20 00:40:41

解决方案2
1 2020-03-20 13:37:20

为 Python 3 使用`urllib.request`

`requests`包

为什么需要显式编码？

由于解码错误，无法使用 Python 从 .txt 文件中读取行

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-03-20 00:40:41

解决方案2 1 2020-03-20 13:37:20

为 Python 3 使用urllib.request

requests包

为什么需要显式编码？

解决方案1
2 已采纳 2020-03-20 00:40:41

解决方案2
1 2020-03-20 13:37:20

为 Python 3 使用`urllib.request`

`requests`包