Python：正则表达式以提取html中任意两个标签之间的文本

Question

I tried using "<.+>\\s*(.*?)\\s*<\\/?.+>" on a HTML file. 我尝试在HTML文件上使用"<.+>\\s*(.*?)\\s*<\\/?.+>" 。 The following is the Python code I used 以下是我使用的Python代码

import re

def recursiveExtractor(content):
    re1='(<.+>\s*(.+?)\s*<\/?.+>)'
    m = re.findall(re1,content)
    if m:
        for (id,item) in enumerate(m):
            text=m[id][1]
            if text:print text,"\n"

f = """
<div class='a'>
      <div class='b'>
        <div class='c'>
            <button>text1</button>
            <div class='d'>text2</div>
        </div>
      </div>
    </div>
"""
recursiveExtractor(f)

But it skips some text since HTML is nested and regex restarts search from the end of the matched part. 但是它会跳过一些文本，因为HTML是嵌套的，而regex从匹配部分的末尾重新开始搜索。

For the above input, the output is 对于上述输入，输出为

<div class='b'>

<div class='d'>text2</div>

</div>

But the expected Output is: 但是预期的输出是：

text1

text2

Edit: I read that HTML is not a regular language and hence cant be parsed.From what I understand, it is not possible to parse .* (ie with same closing tags). 编辑：我读到HTML不是常规语言，因此无法解析。据我了解，无法解析。*（即使用相同的结束标记）。 But what I need would be text between any tags, for instance text1 text2 text3 So I am fine with a list of "text1","text2","text3" 但是我需要的是任何标签之间的文本，例如text1 text2 text3所以我可以使用“ text1”，“ text2”，“ text3”的列表

Answer 1

Why not just doing this: 为什么不这样做：

import re

f = """
<div class='a'>
      <div class='b'>
        <div class='c'>
            <button>text1</button>
            <div class='d'>text2</div>
        </div>
      </div>
    </div>
"""
x = re.sub('<[^>]*>', '', f)  # you can also use re.sub('<[A-Za-z\/][^>]*>', '', f)

print '\n'.join(x.split())

This will have the following output: 这将具有以下输出：

text1
text2

Python：正则表达式以提取html中任意两个标签之间的文本

问题描述

1 个解决方案

解决方案1
5 已采纳

Python：正则表达式以提取html中任意两个标签之间的文本

问题描述

1 个解决方案

解决方案1 5 已采纳

解决方案1
5 已采纳