简体   繁体   中英

Ignore children nodes when searching for regular expression

I want to identify a splitpoint in a large text document with BeautifulSoup. Therefore, I have formulated a regular expression to find the Tag in which a specific string occurs. The problem is that it does not work if there is further formatting / children nodes within the string I am searching for.

t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")

t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")

t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'

t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None

The output should be the p Tag object.

The problem you have here is that the text you are looking for is split with strong tags inside the p node, and thus the regex search using text argument in .find won't work, it is just how it is implemented in BS.

If you know that the texts are in p nodes , you can use a lambda expression in the .find call and run a regex search against text property of each p tag to find the elements you need:

print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text)))
# => <p class="p p8"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>

Note that [s] is the same as s in a regex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM