Ignore children nodes when searching for regular expression

Question

I want to identify a splitpoint in a large text document with BeautifulSoup. Therefore, I have formulated a regular expression to find the Tag in which a specific string occurs. The problem is that it does not work if there is further formatting / children nodes within the string I am searching for.

t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")

t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")

t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'

t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None

The output should be the p Tag object.

Answer 1

The problem you have here is that the text you are looking for is split with strong tags inside the p node, and thus the regex search using text argument in .find won't work, it is just how it is implemented in BS.

If you know that the texts are in p nodes , you can use a lambda expression in the .find call and run a regex search against text property of each p tag to find the elements you need:

print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text)))
# => <p class="p p8"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>

Note that [s] is the same as s in a regex.

Ignore children nodes when searching for regular expression

Question

1 answers

solution1
1 ACCPTED 2019-02-07 12:47:50

Ignore children nodes when searching for regular expression

Question

1 answers

solution1 1 ACCPTED 2019-02-07 12:47:50

solution1
1 ACCPTED 2019-02-07 12:47:50