[英]Ignore children nodes when searching for regular expression
I want to identify a splitpoint in a large text document with BeautifulSoup. 我想用BeautifulSoup识别大文本文档中的分割点。 Therefore, I have formulated a regular expression to find the Tag in which a specific string occurs. 因此,我已经制定了一个正则表达式来查找出现特定字符串的Tag。 The problem is that it does not work if there is further formatting / children nodes within the string I am searching for. 问题是,如果我正在搜索的字符串中还有格式/子节点,则它不起作用。
t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")
t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")
t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'
t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None
The output should be the p
Tag object. 输出应该是p
Tag对象。
The problem you have here is that the text you are looking for is split with strong
tags inside the p
node, and thus the regex search using text
argument in .find
won't work, it is just how it is implemented in BS. 你在这里遇到的问题是你正在寻找的文本在p
节点内被strong
标签拆分,因此.find
使用text
参数的正则表达式搜索将不起作用,它只是在BS中实现的方式。
If you know that the texts are in p
nodes , you can use a lambda expression in the .find
call and run a regex search against text
property of each p
tag to find the elements you need: 如果您知道文本在p
节点中 ,则可以在.find
调用中使用lambda表达式,并对每个p
标记的text
属性运行正则表达式搜索,以查找所需的元素:
print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text)))
# => <p class="p p8"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>
Note that [s]
is the same as s
in a regex. 请注意, [s]
与正则表达式中的s
相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.