如何从 html 页面中排除内容并只保留 html 标签？

Question

I have a huge corpus of HTML pages and I want to exclude all the content from this dataset and finally extracting only the html tags(I want the tags, not the contents).我有一个庞大的 HTML 页面语料库，我想从这个数据集中排除所有内容，最后只提取 html 标签（我想要标签，而不是内容）。 For instance if i have this html elements:例如，如果我有这个 html 元素：

<div class="tensorsite-content__title  ">
      Differentiate yourself with the TensorFlow Developer Certificate    </div>

I need to extract only :我只需要提取：

 <div class="tensorsite-content__title  ">
           </div>

I have tried the (?!) negative lookahead regex to exclude the html tags matches with我已经尝试过 (?!) 否定前瞻正则表达式来排除 html 标签匹配

tags=re.sub('.*?!<[^<]+?>', '',htmlwithcontent )

but despite the fact it doesn't look smart and efficient, obviously, it doesn't work even!但尽管它看起来并不聪明和高效，但显然，它甚至不起作用！

So do you have any Idea?那么你有什么想法吗？ preferably in python最好在python中

Answer 1

As Ivar commented, an HTML parser is really the only way to correctly deal with this class of problem:正如 Ivar 评论的那样，HTML 解析器确实是正确处理此类问题的唯一方法：

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.indent = -1

    def handle_starttag(self, tag, attrs):
        self.indent += 1
        print(2 * self.indent * ' ', sep='', end='')
        print(f'<{tag}', sep='', end='')
        for attr in attrs:
            print(f' {attr[0]}="{attr[1]}"', sep='', end='')
        print('>', sep='')

    def handle_endtag(self, tag):
        print(2 * self.indent * ' ', sep='', end='')
        print(f'</{tag}>')
        self.indent -= 1

parser = MyHTMLParser()
parser.feed("""<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <h1>Heading!</h1>
    <p style="font-weight: bold; color: red;">
       Some text
       <BR/>
       Some more text
    </p>
    <ol>
       <li>Item 1</li>
       <li>Item 2</li>
     </ol>
  </body>
</html>
""")

Prints:印刷：

<html>
  <head>
    <title>
    </title>
  </head>
  <body>
    <h1>
    </h1>
    <p style="font-weight: bold; color: red;">
      <br>
      </br>
    </p>
    <ol>
      <li>
      </li>
      <li>
      </li>
    </ol>
  </body>
</html>

See Python Demo见 Python 演示

Update更新

If the HTML is is a not-too-large file, it make sense to read the entire file into memory and pass to the parser thus:如果 HTML 是一个不太大的文件，那么将整个文件读入内存并传递给解析器是有意义的：

parser = MyHTMLParser()
with open('test.html') as f:
    html = f.read()
    parser.feed(html)

If the input is in a extremely large file, it might make sense to "feed" the parser line by line or in chunks rather than attempting to read the entire file into memory:如果输入是在一个非常大的文件中，那么逐行或分块“馈送”解析器而不是尝试将整个文件读入内存可能是有意义的：

Line by Line:逐行：

parser = MyHTMLParser()
with open('test.html') as f:
    for line in f:
        parser.feed(line)

Or even more efficiently:或者更有效：

To Read in Chunks of 32K:以 32K 的块读取：

CHUNK_SIZE = 32 * 1024
parser = MyHTMLParser()
with open('test.html') as f:
    while True:
        chunk = f.read(CHUNK_SIZE)
        if chunk == '':
            break
        parser.feed(chunk)

You can, of course, choose even larger chunk sizes.当然，您可以选择更大的块大小。

如何从 html 页面中排除内容并只保留 html 标签？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-16 12:10:16

如何从 html 页面中排除内容并只保留 html 标签？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-16 12:10:16

解决方案1
1 已采纳 2020-03-16 12:10:16