当我分割一些HTML源代码时，为什么会出现b'（有时是b''）[Python]

Question

I'm fairly new to Python and programming in general. 我对Python和编程很新。 I have done a few tutorials and am about 2/3 through a pretty good book. 我做了一些教程，通过一本非常好的书大约2/3。 That being said I've been trying to get more comfortable with Python and proggramming by just trying things in the std lib out. 话虽这么说，我一直试图通过在std lib中尝试一些事情来更熟悉Python和编程。

that being said I have recently run into a wierd quirk that I'm sure is the result of my own incorrect or un-"pythonic" use of the urllib module(with Python 3.2.2) 据说我最近遇到了一个奇怪的怪癖，我确信这是我自己的错误或非“pythonic”使用urllib模块（使用Python 3.2.2）的结果

import urllib.request

HTML_source = urllib.request.urlopen(www.somelink.com).read()

print(HTML_source)

when this bit is run through the active interpreter it returns the HTML source of somelink, however it prefixes it with b' for example 当这个位通过活动解释器运行时，它返回somelink的HTML源代码，但是它以b'作为前缀

b'<HTML>\r\n<HEAD> (etc). . . .

if I split the string into a list by whitespace it prefixes every item with the b' 如果我通过空格将字符串拆分成一个列表，它会为每个项目添加前缀b'

I'm not really trying to accomplish something specific just trying to familiarize myself with the std lib. 我并没有真正想要完成一些特定的事情，只是想让自己熟悉std lib。 I would like to know why this b' is getting prefixed 我想知道为什么这个b'有前缀

also bonus -- Is there a better way to get HTML source WITHOUT using a third party module. 还有奖励 - 有没有更好的方法来获取HTML源而不使用第三方模块。 I know all that jazz about not reinventing the wheel and what not but I'm trying to learn by "building my own tools" 我知道所有那些关于不重新发明轮子的爵士乐，但是我正试图通过“构建自己的工具”来学习

Thanks in Advance! 提前致谢！

Answer 1

The "b" prefix means that the type is bytes not str . “b”前缀表示类型是字节而不是str 。 To convert the bytes into text, use the decode method and name the appropriate encoding. 要将字节转换为文本，请使用decode方法并命名相应的编码。 The encoding is often found in the "Content-Type" header: 编码通常位于“Content-Type”标题中：

>>> u = urllib.request.urlopen('http://cnn.com')
>>> u.getheader('Content-Type')
'text/html; charset=UTF-8'
>>> html = u.read().decode('utf-8')
>>> type(html)
<class 'str'>

If you don't find the encoding in the headers, try utf-8 as a default. 如果在标题中找不到编码，请尝试使用utf-8作为默认值。

Answer 2

b'' is a literal bytes object. b''是一个文字字节对象。 There is no b'' objects in memory, only bytes . 内存中没有b''对象，只有bytes 。 It is just a notation for bytes objects in your source code. 它只是源代码中字节对象的表示法。 Plain quotes '' in the source code create 'str' objects (Unicode strings). 源代码中的简单引号''创建'str'对象（Unicode字符串）。

If bytes object represents a text (not a binary data such as an image) then in general you should decode it to Unicode string as soon as possible. 如果bytes对象表示文本（不是二进制数据，如图像），那么通常您应该尽快将其解码为Unicode字符串。 You should know the character encoding of the text . 您应该知道文本的字符编码。

HTML parsers such as lxml.html , BeautifulSoup may convert bytes to Unicode without your intervention. HTML解析器（如lxml.html ， BeautifulSoup可以在没有您干预的情况下将字节转换为Unicode。

If you don't know encoding then it might be none-trivial to detect it eg, read how feedparser detects character encoding [2006] . 如果您不知道编码，那么检测它可能并不重要，例如，阅读feedparser如何检测字符编码[2006] 。

当我分割一些HTML源代码时，为什么会出现b'（有时是b''）[Python]

问题描述

2 个解决方案

解决方案1
7 已采纳 2011-11-12 04:07:28

解决方案2
2 2011-11-12 09:42:26

当我分割一些HTML源代码时，为什么会出现b&#39;（有时是b&#39;&#39;）[Python]

问题描述

2 个解决方案

解决方案1 7 已采纳 2011-11-12 04:07:28

解决方案2 2 2011-11-12 09:42:26

当我分割一些HTML源代码时，为什么会出现b'（有时是b''）[Python]

解决方案1
7 已采纳 2011-11-12 04:07:28

解决方案2
2 2011-11-12 09:42:26