简体   繁体   English

Python Beautiful Soup .content属性

[英]Python Beautiful Soup .content Property

What does BeautifulSoup's .content do? BeautifulSoup的内容有什么作用? I am working through crummy.com's tutorial and I don't really understand what .content does. 我正在通过crummy.com的教程,我真的不明白.content是做什么的。 I have looked at the forums and I have not seen any answers. 我看过论坛,我没有看到任何答案。 Looking at the code below.... 看下面的代码......

from BeautifulSoup import BeautifulSoup
import re



doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
        '</html>']

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[0].contents[0].contents[0].name

I would expect the last line of the code to print out 'body' instead of... 我希望代码的最后一行打印出'body'而不是......

  File "pe_ratio.py", line 29, in <module>
    print soup.contents[0].contents[0].contents[0].contents[0].name
  File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'

Is .content only concerned with html, head and title? .content只关注html,head和title吗? If, so why is that? 如果,那为什么呢?

Thanks for the help in advance. 我在这里先向您的帮助表示感谢。

It just gives you whats inside the tag. 它只是给你标签的什么。 Let me demonstrate with an example: 让我举个例子来证明:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
head = soup.head

print head.contents

The above code gives me a list, [<title>The Dormouse's story</title>] , because thats inside the head tag. 上面的代码给了我一个列表, [<title>The Dormouse's story</title>] ,因为它 head标签内。 So calling [0] would give you the first item in the list. 所以调用[0]会给你列表中的第一项。

The reason you get an error is because soup.contents[0].contents[0].contents[0].contents[0] returns something with no further tags (therefore no attributes). 你得到错误的原因是因为soup.contents[0].contents[0].contents[0].contents[0]返回没有其他标签的东西(因此没有属性)。 It returns Page Title from your code, because the first contents[0] gives you the HTML tag, the second one, gives you the head tag. 它返回代码中的Page Title ,因为第一个contents[0]为您提供HTML标记,第二个内容为您提供head标记。 The third one leads to the title tag, and the fourth one gives you the actual content. 第三个引出title标签,第四个引出实际内容。 So, when you call a name on it, it has no tags to give you. 因此,当您在其上调用name时,它没有标记可以提供给您。

If you want the body printed, you can do the following: 如果要打印正文,可以执行以下操作:

soup = BeautifulSoup(''.join(doc))
print soup.body

If you want body using contents only, then use the following: 如果你想body使用contents而已,然后使用以下命令:

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[1].name

You will not get it using [0] as the index, because body is the second element after head . 你不会使用[0]作为索引,因为bodyhead之后的第二个元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM