[英]When using BeautifulSoup 4's `find_all` with a regex, how do I access regex match capture groups?
I'm using BeautifulSoup 4, and I'm using find_all
with a regular expression to find all the links matching a particular pattern. 我正在使用BeautifulSoup 4,我正在使用带有正则表达式的
find_all
来查找与特定模式匹配的所有链接。
results = page.find_all(href=re.compile("foo/bar\?baz="))
for result in results:
...
However I also want to extract a parameter from the URL. 但是,我还想从URL中提取参数。
I can mark the parameter for extraction by putting a capture group on it: 我可以通过在其上放置一个捕获组来标记要提取的参数:
results = page.find_all(href=re.compile("foo/bar\?baz=([^&]+)"))
But if I do this, how do I access the value of the capture group in a particular match? 但是,如果我这样做,如何在特定匹配中访问捕获组的值?
Yes, you can. 是的你可以。 Make helper class with magic methods
__call__()
and __iter__()
and supply instance of this class as a function to BeautifulSoup find_all()
function. 使用魔术方法
__call__()
和__iter__()
辅助类,并将__iter__()
实例作为函数提供给BeautifulSoup find_all()
函数。 I used zip()
to tie the groups with matched elements: 我使用
zip()
将组与匹配的元素绑定在一起:
from bs4 import BeautifulSoup, Tag
import re
data = '''<div>
<a href="link_1">Link 1</a>
<a href="link_2">Link 1</a>
<a href="link_XXX">Link 1</a>
<a href="link_3">Link 1</a>
</div>'''
soup = BeautifulSoup(data, 'lxml')
class my_regex_searcher:
def __init__(self, regex_string):
self.__r = re.compile(regex_string)
self.groups = []
def __call__(self, what):
if isinstance(what, Tag):
what = what.name
if what:
g = self.__r.findall(what)
if g:
self.groups.append(g)
return True
return False
def __iter__(self):
yield from self.groups
searcher = my_regex_searcher(r'link_(\d+)')
for l, groups in zip(soup.find_all(href=searcher), searcher):
print(l)
print(groups)
searcher = my_regex_searcher(r'(d)(i)(v)')
for l, groups in zip(soup.find_all(searcher), searcher):
print(l.prettify())
print(groups)
Prints: 打印:
<a href="link_1">Link 1</a>
['1']
<a href="link_2">Link 1</a>
['2']
<a href="link_3">Link 1</a>
['3']
<div>
<a href="link_1">
Link 1
</a>
<a href="link_2">
Link 1
</a>
<a href="link_XXX">
Link 1
</a>
<a href="link_3">
Link 1
</a>
</div>
[('d', 'i', 'v')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.