I want to identify a splitpoint in a large text document with BeautifulSoup. Therefore, I have formulated a regular expression to find the Tag in which a specific string occurs. The problem is that it does not work if there is further formatting / children nodes within the string I am searching for.
t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")
t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")
t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'
t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None
The output should be the p
Tag object.
The problem you have here is that the text you are looking for is split with strong
tags inside the p
node, and thus the regex search using text
argument in .find
won't work, it is just how it is implemented in BS.
If you know that the texts are in p
nodes , you can use a lambda expression in the .find
call and run a regex search against text
property of each p
tag to find the elements you need:
print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text)))
# => <p class="p p8"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>
Note that [s]
is the same as s
in a regex.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.