简体   繁体   English

使用Beautiful Soup搜索属性的各个部分

[英]Searching for pieces of an attribute with Beautiful Soup

I want to use Beautiful Soup to pull out anything with the following format: 我想使用Beautiful Soup提取具有以下格式的所有内容:

div class="dog-a b-cat"

I can get a particular instance if I know what "a" and "b" are by doing the following (suppose a=aardvark and b=boy ): 如果我通过执行以下操作来知道“ a”和“ b”是什么,则可以得到一个特定实例(假设a=aardvarkb=boy ):

soup.find_all("div",class_="dog-aardvark boy-cat")

Is there any way I can pull out all instances (regardless of the two words between the dashes) with dog and cat and two dashes in between? 有什么办法可以拉出所有实例(无论破折号之间的两个字如何),其中有dog和cat以及介于两者之间的两个破折号?

@bourbaki4481472 is on the right track in general but the proposed solution would not work because of multiple reasons, starting with that the specified regular expression would be matched against a single class at a time , since class is a special multi-valued attribute and ending with it's simply syntactically incorrect . @ bourbaki4481472总体上是正确的,但是由于多种原因,建议的解决方案将不起作用,因为指定的正则表达式将一次与单个类进行匹配 ,因为class是特殊的多值属性,而结束在语法上根本不正确

I suggest you make a filtering function that would check that the first class value starts-with dog- and the second one ends with -cat . 我建议您创建一个过滤函数 ,以检查第一个类的值以dog-开头,第二个以-cat结尾。 You may improve it by additionally checking the tag name or how much class values are present if needed: 您可以通过另外检查标记名称或如果需要的话提供多少类值来改进它:

def class_filter(elm):
    try:
        classes = elm["class"]
        return classes[0].startswith("dog-") and classes[1].endswith("-cat")
    except (KeyError, IndexError, TypeError):
        return False

Complete example: 完整的例子:

from bs4 import BeautifulSoup

data = """
<div class="dog-test test-cat">test1</div>
<div class="dog-test">test2</div>
<div class="test-cat">test3</div>
<div class="dog">test4</div>
<div class="cat">test5</div>
<div class="irrelevant">test6</div>
"""

soup = BeautifulSoup(data)

def class_filter(elm):
    try:
        classes = elm["class"]
        return classes[0].startswith("dog-") and classes[1].endswith("-cat")
    except (KeyError, IndexError, TypeError):
        return False

for elm in soup.find_all(class_filter):
    print(elm.text)

Prints test1 only. 仅打印test1

Try using regular expressions to generalize your parameters. 尝试使用正则表达式来概括您的参数。

import re
soup.find_all("div", class= re.compile(r"dog-.+ boy-.+")

The above would look for strings dog- followed by one or more characters, followed by [space], and followed by boy- followed by one or more characters. 上面的代码将查找字符串dog-后跟一个或多个字符,然后是[space],然后是boy-然后是一个或多个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM