简体   繁体   English

搜索正则表达式时忽略子节点

[英]Ignore children nodes when searching for regular expression

I want to identify a splitpoint in a large text document with BeautifulSoup. 我想用BeautifulSoup识别大文本文档中的分割点。 Therefore, I have formulated a regular expression to find the Tag in which a specific string occurs. 因此,我已经制定了一个正则表达式来查找出现特定字符串的Tag。 The problem is that it does not work if there is further formatting / children nodes within the string I am searching for. 问题是,如果我正在搜索的字符串中还有格式/子节点,则它不起作用。

t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")

t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")

t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'

t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None

The output should be the p Tag object. 输出应该是p Tag对象。

The problem you have here is that the text you are looking for is split with strong tags inside the p node, and thus the regex search using text argument in .find won't work, it is just how it is implemented in BS. 你在这里遇到的问题是你正在寻找的文本在p节点内被strong标签拆分,因此.find使用text参数的正则表达式搜索将不起作用,它只是在BS中实现的方式。

If you know that the texts are in p nodes , you can use a lambda expression in the .find call and run a regex search against text property of each p tag to find the elements you need: 如果您知道文本在p节点中 ,则可以在.find调用中使用lambda表达式,并对每个p标记的text属性运行正则表达式搜索,以查找所需的元素:

print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text)))
# => <p class="p p8"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>

Note that [s] is the same as s in a regex. 请注意, [s]与正则表达式中的s相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM