简体   繁体   English

'utf-8'编解码器无法解码位置139604中的字节0xf6:无效的起始字节

[英]'utf-8' codec can't decode byte 0xf6 in position 139604: invalid start byte

I am making a knowledge engineering project. 我正在做一个知识工程项目。

When I was crawling some scientists personal site, this bug occurred. 当我在搜寻某些科学家的个人站点时,发生了此错误。

import html2text
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import urllib


homepage = "http://angom.myweb.cs.uwindsor.ca"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=homepage, headers=headers)
print(req)
c = urlopen(req).read()
print(type(c))

content = urlopen(req).read().decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 139604: invalid start byte UnicodeDecodeError:“ utf-8”编解码器无法解码位置139604中的字节0xf6:无效的起始字节

The encoding in the page header states: 页面标题中的编码说明:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

.. so use that when decoding the string. ..所以在解码字符串时使用它。

content = urlopen(req).read().decode("windows-1252")

will work in this instance. 将在这种情况下工作。

If you are planning to use BeautifulSoup, it already does a really good job figuring out the encoding . 如果您打算使用BeautifulSoup, 那么在确定编码方面已经做得非常好

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError:&#39;utf8&#39;编解码器无法解码位置178175077中的字节0xf6:无效的起始字节 - UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 178175077: invalid start byte &#39;utf-8&#39;编解码器无法解码位置1的字节0xf4:无效的连续字节 - 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 2 中的字节 0xf1:继续字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte Python decode() 'utf-8' 编解码器无法解码 position 中的字节 0xff 0:无效的起始字节 - Python decode() 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置0的字节0x80:无效的起始字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte UnicodeDecodeError:“ utf-8”编解码器无法解码位置0的字节0xff:无效的起始字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte “utf-8”编解码器无法解码位置 928 中的字节 0x93:起始字节无效 - 'utf-8' codec can't decode byte 0x93 in position 928: invalid start byte UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 4 中的字节 0xb4:起始字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 4: invalid start byte utf-8&#39; 编解码器无法解码位置 0 中的字节 0xb5:起始字节无效 - utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 0 中的字节 0xb0:起始字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM