搜索正则表达式时忽略子节点

Question

I want to identify a splitpoint in a large text document with BeautifulSoup. 我想用BeautifulSoup识别大文本文档中的分割点。 Therefore, I have formulated a regular expression to find the Tag in which a specific string occurs. 因此，我已经制定了一个正则表达式来查找出现特定字符串的Tag。 The problem is that it does not work if there is further formatting / children nodes within the string I am searching for. 问题是，如果我正在搜索的字符串中还有格式/子节点，则它不起作用。

t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")

t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")

t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'

t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None

The output should be the p Tag object. 输出应该是p Tag对象。

Answer 1

The problem you have here is that the text you are looking for is split with strong tags inside the p node, and thus the regex search using text argument in .find won't work, it is just how it is implemented in BS. 你在这里遇到的问题是你正在寻找的文本在p节点内被strong标签拆分，因此.find使用text参数的正则表达式搜索将不起作用，它只是在BS中实现的方式。

If you know that the texts are in p nodes , you can use a lambda expression in the .find call and run a regex search against text property of each p tag to find the elements you need: 如果您知道文本在p节点中 ，则可以在.find调用中使用lambda表达式，并对每个p标记的text属性运行正则表达式搜索，以查找所需的元素：

print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text)))
# => <p class="p p8"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>

Note that [s] is the same as s in a regex. 请注意， [s]与正则表达式中的s相同。

搜索正则表达式时忽略子节点

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-02-07 12:47:50

搜索正则表达式时忽略子节点

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-02-07 12:47:50

解决方案1
1 已采纳 2019-02-07 12:47:50