Python Beautiful Soup .content属性

Question

What does BeautifulSoup's .content do? BeautifulSoup的内容有什么作用？ I am working through crummy.com's tutorial and I don't really understand what .content does. 我正在通过crummy.com的教程，我真的不明白.content是做什么的。 I have looked at the forums and I have not seen any answers. 我看过论坛，我没有看到任何答案。 Looking at the code below.... 看下面的代码......

from BeautifulSoup import BeautifulSoup
import re



doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
        '</html>']

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[0].contents[0].contents[0].name

I would expect the last line of the code to print out 'body' instead of... 我希望代码的最后一行打印出'body'而不是......

  File "pe_ratio.py", line 29, in <module>
    print soup.contents[0].contents[0].contents[0].contents[0].name
  File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'

Is .content only concerned with html, head and title? .content只关注html，head和title吗？ If, so why is that? 如果，那为什么呢？

Thanks for the help in advance. 我在这里先向您的帮助表示感谢。

Answer 1

It just gives you whats inside the tag. 它只是给你标签内的什么。 Let me demonstrate with an example: 让我举个例子来证明：

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
head = soup.head

print head.contents

The above code gives me a list, [<title>The Dormouse's story</title>] , because thats inside the head tag. 上面的代码给了我一个列表， [<title>The Dormouse's story</title>] ，因为它在 head标签内。 So calling [0] would give you the first item in the list. 所以调用[0]会给你列表中的第一项。

The reason you get an error is because soup.contents[0].contents[0].contents[0].contents[0] returns something with no further tags (therefore no attributes). 你得到错误的原因是因为soup.contents[0].contents[0].contents[0].contents[0]返回没有其他标签的东西（因此没有属性）。 It returns Page Title from your code, because the first contents[0] gives you the HTML tag, the second one, gives you the head tag. 它返回代码中的Page Title ，因为第一个contents[0]为您提供HTML标记，第二个内容为您提供head标记。 The third one leads to the title tag, and the fourth one gives you the actual content. 第三个引出title标签，第四个引出实际内容。 So, when you call a name on it, it has no tags to give you. 因此，当您在其上调用name时，它没有标记可以提供给您。

If you want the body printed, you can do the following: 如果要打印正文，可以执行以下操作：

soup = BeautifulSoup(''.join(doc))
print soup.body

If you want body using contents only, then use the following: 如果你想body使用contents而已，然后使用以下命令：

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[1].name

You will not get it using [0] as the index, because body is the second element after head . 你不会使用[0]作为索引，因为body是head之后的第二个元素。

Python Beautiful Soup .content属性

问题描述

1 个解决方案

解决方案1
3 已采纳 2013-10-26 03:46:06

Python Beautiful Soup .content属性

问题描述

1 个解决方案

解决方案1 3 已采纳 2013-10-26 03:46:06

解决方案1
3 已采纳 2013-10-26 03:46:06