[英]Python: Regular expression to extract text between any two tags in a html
I tried using "<.+>\\s*(.*?)\\s*<\\/?.+>"
on a HTML file. 我尝试在HTML文件上使用
"<.+>\\s*(.*?)\\s*<\\/?.+>"
。 The following is the Python code I used 以下是我使用的Python代码
import re
def recursiveExtractor(content):
re1='(<.+>\s*(.+?)\s*<\/?.+>)'
m = re.findall(re1,content)
if m:
for (id,item) in enumerate(m):
text=m[id][1]
if text:print text,"\n"
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
recursiveExtractor(f)
But it skips some text since HTML is nested and regex restarts search from the end of the matched part. 但是它会跳过一些文本,因为HTML是嵌套的,而regex从匹配部分的末尾重新开始搜索。
For the above input, the output is 对于上述输入,输出为
<div class='b'>
<div class='d'>text2</div>
</div>
But the expected Output is: 但是预期的输出是:
text1
text2
Edit: I read that HTML is not a regular language and hence cant be parsed.From what I understand, it is not possible to parse .* (ie with same closing tags). 编辑:我读到HTML不是常规语言,因此无法解析。据我了解,无法解析。*(即使用相同的结束标记)。 But what I need would be text between any tags, for instance text1 text2 text3 So I am fine with a list of "text1","text2","text3"
但是我需要的是任何标签之间的文本,例如text1 text2 text3所以我可以使用“ text1”,“ text2”,“ text3”的列表
Why not just doing this: 为什么不这样做:
import re
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
x = re.sub('<[^>]*>', '', f) # you can also use re.sub('<[A-Za-z\/][^>]*>', '', f)
print '\n'.join(x.split())
This will have the following output: 这将具有以下输出:
text1
text2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.