I tried using "<.+>\\s*(.*?)\\s*<\\/?.+>"
on a HTML file. The following is the Python code I used
import re
def recursiveExtractor(content):
re1='(<.+>\s*(.+?)\s*<\/?.+>)'
m = re.findall(re1,content)
if m:
for (id,item) in enumerate(m):
text=m[id][1]
if text:print text,"\n"
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
recursiveExtractor(f)
But it skips some text since HTML is nested and regex restarts search from the end of the matched part.
For the above input, the output is
<div class='b'>
<div class='d'>text2</div>
</div>
But the expected Output is:
text1
text2
Edit: I read that HTML is not a regular language and hence cant be parsed.From what I understand, it is not possible to parse .* (ie with same closing tags). But what I need would be text between any tags, for instance text1 text2 text3 So I am fine with a list of "text1","text2","text3"
Why not just doing this:
import re
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
x = re.sub('<[^>]*>', '', f) # you can also use re.sub('<[A-Za-z\/][^>]*>', '', f)
print '\n'.join(x.split())
This will have the following output:
text1
text2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.