当使用带有正则表达式的BeautifulSoup 4的`find_all`时，如何访问正则表达式匹配捕获组？

Question

I'm using BeautifulSoup 4, and I'm using find_all with a regular expression to find all the links matching a particular pattern. 我正在使用BeautifulSoup 4，我正在使用带有正则表达式的find_all来查找与特定模式匹配的所有链接。

results = page.find_all(href=re.compile("foo/bar\?baz="))
for result in results:
    ...

However I also want to extract a parameter from the URL. 但是，我还想从URL中提取参数。

I can mark the parameter for extraction by putting a capture group on it: 我可以通过在其上放置一个捕获组来标记要提取的参数：

results = page.find_all(href=re.compile("foo/bar\?baz=([^&]+)"))

But if I do this, how do I access the value of the capture group in a particular match? 但是，如果我这样做，如何在特定匹配中访问捕获组的值？

Answer 1

Yes, you can. 是的你可以。 Make helper class with magic methods __call__() and __iter__() and supply instance of this class as a function to BeautifulSoup find_all() function. 使用魔术方法__call__()和__iter__()辅助类，并将__iter__()实例作为函数提供给BeautifulSoup find_all()函数。 I used zip() to tie the groups with matched elements: 我使用zip()将组与匹配的元素绑定在一起：

from bs4 import BeautifulSoup, Tag
import re

data = '''<div>
<a href="link_1">Link 1</a>
<a href="link_2">Link 1</a>
<a href="link_XXX">Link 1</a>
<a href="link_3">Link 1</a>
</div>'''

soup = BeautifulSoup(data, 'lxml')

class my_regex_searcher:
    def __init__(self, regex_string):
        self.__r = re.compile(regex_string)
        self.groups = []

    def __call__(self, what):
        if isinstance(what, Tag):
            what = what.name

        if what:
            g = self.__r.findall(what)
            if g:
                self.groups.append(g)
                return True
        return False

    def __iter__(self):
        yield from self.groups

searcher = my_regex_searcher(r'link_(\d+)')
for l, groups in zip(soup.find_all(href=searcher), searcher):
    print(l)
    print(groups)

searcher = my_regex_searcher(r'(d)(i)(v)')
for l, groups in zip(soup.find_all(searcher), searcher):
    print(l.prettify())
    print(groups)

Prints: 打印：

<a href="link_1">Link 1</a>
['1']
<a href="link_2">Link 1</a>
['2']
<a href="link_3">Link 1</a>
['3']
<div>
 <a href="link_1">
  Link 1
 </a>
 <a href="link_2">
  Link 1
 </a>
 <a href="link_XXX">
  Link 1
 </a>
 <a href="link_3">
  Link 1
 </a>
</div>
[('d', 'i', 'v')]

当使用带有正则表达式的BeautifulSoup 4的`find_all`时，如何访问正则表达式匹配捕获组？

问题描述

1 个解决方案

解决方案1
1 2019-06-20 06:44:34

当使用带有正则表达式的BeautifulSoup 4的`find_all`时，如何访问正则表达式匹配捕获组？

问题描述

1 个解决方案

解决方案1 1 2019-06-20 06:44:34

解决方案1
1 2019-06-20 06:44:34