简体   繁体   English

当我分割一些HTML源代码时,为什么会出现b'(有时是b'')[Python]

[英]why does b'(and sometimes b' ') show up when I split some HTML source[Python]

I'm fairly new to Python and programming in general. 我对Python和编程很新。 I have done a few tutorials and am about 2/3 through a pretty good book. 我做了一些教程,通过一本非常好的书大约2/3。 That being said I've been trying to get more comfortable with Python and proggramming by just trying things in the std lib out. 话虽这么说,我一直试图通过在std lib中尝试一些事情来更熟悉Python和编程。

that being said I have recently run into a wierd quirk that I'm sure is the result of my own incorrect or un-"pythonic" use of the urllib module(with Python 3.2.2) 据说我最近遇到了一个奇怪的怪癖,我确信这是我自己的错误或非“pythonic”使用urllib模块(使用Python 3.2.2)的结果

import urllib.request

HTML_source = urllib.request.urlopen(www.somelink.com).read()

print(HTML_source)

when this bit is run through the active interpreter it returns the HTML source of somelink, however it prefixes it with b' for example 当这个位通过活动解释器运行时,它返回somelink的HTML源代码,但是它以b'作为前缀

b'<HTML>\r\n<HEAD> (etc). . . .

if I split the string into a list by whitespace it prefixes every item with the b' 如果我通过空格将字符串拆分成一个列表,它会为每个项目添加前缀b'

I'm not really trying to accomplish something specific just trying to familiarize myself with the std lib. 我并没有真正想要完成一些特定的事情,只是想让自己熟悉std lib。 I would like to know why this b' is getting prefixed 我想知道为什么这个b'有前缀

also bonus -- Is there a better way to get HTML source WITHOUT using a third party module. 还有奖励 - 有没有更好的方法来获取HTML源而不使用第三方模块。 I know all that jazz about not reinventing the wheel and what not but I'm trying to learn by "building my own tools" 我知道所有那些关于不重新发明轮子的爵士乐,但是我正试图通过“构建自己的工具”来学习

Thanks in Advance! 提前致谢!

The "b" prefix means that the type is bytes not str . “b”前缀表示类型是字节而不是str To convert the bytes into text, use the decode method and name the appropriate encoding. 要将字节转换为文本,请使用decode方法并命名相应的编码。 The encoding is often found in the "Content-Type" header: 编码通常位于“Content-Type”标题中:

>>> u = urllib.request.urlopen('http://cnn.com')
>>> u.getheader('Content-Type')
'text/html; charset=UTF-8'
>>> html = u.read().decode('utf-8')
>>> type(html)
<class 'str'>

If you don't find the encoding in the headers, try utf-8 as a default. 如果在标题中找不到编码,请尝试使用utf-8作为默认值。

b'' is a literal bytes object. b''是一个文字字节对象。 There is no b'' objects in memory, only bytes . 内存中没有b''对象,只有bytes It is just a notation for bytes objects in your source code. 它只是源代码中字节对象的表示法。 Plain quotes '' in the source code create 'str' objects (Unicode strings). 源代码中的简单引号''创建'str'对象(Unicode字符串)。

If bytes object represents a text (not a binary data such as an image) then in general you should decode it to Unicode string as soon as possible. 如果bytes对象表示文本(不是二进制数据,如图像),那么通常您应该尽快将其解码为Unicode字符串。 You should know the character encoding of the text . 您应该知道文本的字符编码

HTML parsers such as lxml.html , BeautifulSoup may convert bytes to Unicode without your intervention. HTML解析器(如lxml.htmlBeautifulSoup可以在没有您干预的情况下将字节转换为Unicode。

If you don't know encoding then it might be none-trivial to detect it eg, read how feedparser detects character encoding [2006] . 如果您不知道编码,那么检测它可能并不重要,例如,阅读feedparser如何检测字符编码[2006]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM