简体   繁体   English

用 BeautifulSoup 包装标签的内容

[英]wrap the contents of a tag with BeautifulSoup

I'm tring to wrap the contents of a tag with BeautifulSoup.我想用 BeautifulSoup 包装标签的内容。 This:这个:

<div class="footnotes">
    <p>Footnote 1</p>
    <p>Footnote 2</p>
</div>

should become this:应该变成这样:

<div class="footnotes">
  <ol>
    <p>Footnote 1</p>
    <p>Footnote 2</p>
  </ol>
</div>

So I use the following code:所以我使用以下代码:

footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol") 
for content in footnotes[0].children:
    new_tag = soup.new_tag(content)
    new_ol.append(new_tag)

footnotes[0].clear()
footnotes[0].append(new_ol)

print footnotes[0]

but I get the following:但我得到以下信息:

<div class="footnotes"><ol><
    ></
    ><<p>Footnote 1</p>></<p>Footnote 1</p>><
    ></
    ><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>

Suggestions?建议?

Just move the .contents of your tag over using tag.extract() ;只需使用tag.extract()移动标签的.contents tag.extract() don't try to create them anew with soup.new_tag (which only takes a tag name , not a whole tag object).不要尝试使用soup.new_tag (它只需要一个标签名称,而不是整个标签对象)重新创建它们。 Don't call .clear() on the original tag;不要在原始标签上调用.clear() .extract() already removed the elements. .extract()已经删除了元素。

Move items over in reverse as the contents are being modified in-place, leading to skipped elements if you don't watch out.在就地修改内容时反向移动项目,如果您不注意,则会导致跳过元素。

Finally, use .find() when you only need to do this for one tag.最后,当您只需要为一个标签执行此操作时,请使用.find()

You do need to create a copy of the contents list, as it'll be modified in place您确实需要创建contents列表的副本,因为它将被修改到位

footnotes = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnotes.contents):
    new_ol.insert(0, content.extract())

footnotes.append(new_ol)

Demo:演示:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="footnotes">
...     <p>Footnote 1</p>
...     <p>Footnote 2</p>
... </div>
... ''')
>>> footnotes = soup.find("div", { "class" : "footnotes" })
>>> new_ol = soup.new_tag("ol")
>>> for content in reversed(footnotes.contents):
...     new_ol.insert(0, content.extract())
... 
>>> footnotes.append(new_ol)
>>> print footnotes
<div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div>

Using lxml:使用 lxml:

import lxml.html as LH
import lxml.builder as builder
E = builder.E

doc = LH.parse('data')
footnote = doc.find('//div[@class="footnotes"]')
ol = E.ol()
for tag in footnote:
    ol.append(tag)
footnote.append(ol)
print(LH.tostring(doc.getroot()))

prints印刷

<html><body><div class="footnotes">
    <ol><p>Footnote 1</p>
    <p>Footnote 2</p>
</ol></div></body></html>

Note that with lxml , an Element (tag) can be in only one place in the tree (since every Element has only one parent), so appending tag to ol also removes it from footnote .请注意,对于lxml ,元素(标签)只能位于树中的一个位置(因为每个 Element 只有一个父元素),因此将tag附加到ol也会将其从footnote删除。 So unlike with BeautifulSoup, you do not need to iterate over the contents in reverse order, nor use insert(0,...) .因此,与 BeautifulSoup 不同,您不需要以相反的顺序迭代内容,也不需要使用insert(0,...) You just append in order.您只需按顺序追加即可。


Using BeautifulSoup:使用 BeautifulSoup:

import bs4 as bs
with open('data', 'r') as f:
    soup = bs.BeautifulSoup(f)

footnote = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnote.contents):
    new_ol.insert(0, content.extract())

footnote.append(new_ol)
print(soup)

prints印刷

<html><body><div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM