简体   繁体   English

美丽的汤:寻找嵌套模式?

[英]Beautiful Soup: searching for a nested pattern?

soup.find_all will search a BeautifulSoup document for all occurrences of a single tag. soup.find_all将在BeautifulSoup文档中搜索所有出现的单个标记。 Is there a way to search for particular patterns of nested tags? 有没有办法搜索嵌套标签的特定模式?

For example, I would like to search for all occurrences of this pattern: 例如,我想搜索此模式的所有实例:

<div class="separator">
  <a>
    <img />
  </a>
</div>

Check out this part of the docs . 查看这部分文档 You probably want a function like this: 你可能想要一个像这样的函数:

def nested_img(div):
    child = div.contents[0]
    return child.name == "a" and child.contents[0].name == "img"

soup.find_all("div", nested_img)

PS: This is untested. PS:这是未经测试的。

There are multiple ways to find the pattern, but the easiest one would be to use a CSS selector : 有多种方法可以找到模式,但最简单的方法是使用CSS selector

for img in soup.select('div.separator > a > img'):
    print img  # or img.parent.parent to get the "div"

Demo: 演示:

>>> from bs4 import BeautifulSoup
>>> data = """
... <div>
...     <div class="separator">
...       <a>
...         <img src="test1"/>
...       </a>
...     </div>
... 
...     <div class="separator">
...       <a>
...         <img src="test2"/>
...       </a>
...     </div>
... 
...     <div>test3</div>
... 
...     <div>
...         <a>test4</a>
...     </div>
... </div>
... """
>>> soup = BeautifulSoup(data)
>>> 
>>> for img in soup.select('div.separator > a > img'):
...     print img.get('src')
... 
test1
test2

I do understand that, strictly speaking, the solution would not work if the div has more than just one a child, or inside the a tag there is smth else except the img tag. 我明白,严格来说,该解决方案将不若工作div已经不仅仅是一个多a孩子,或者里面a标签有除其他不便img标签。 If this is the case the solution can be improved with additional checks (will edit the answer if needed). 如果是这种情况,可以通过额外的检查来改进解决方案(如果需要,将编辑答案)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM