简体   繁体   English

如何从 html 页面中排除内容并只保留 html 标签?

[英]how to exclude content from html page and keeping only the html tags?

I have a huge corpus of HTML pages and I want to exclude all the content from this dataset and finally extracting only the html tags(I want the tags, not the contents).我有一个庞大的 HTML 页面语料库,我想从这个数据集中排除所有内容,最后只提取 html 标签(我想要标签,而不是内容)。 For instance if i have this html elements:例如,如果我有这个 html 元素:

<div class="tensorsite-content__title  ">
      Differentiate yourself with the TensorFlow Developer Certificate    </div>

I need to extract only :我只需要提取:

 <div class="tensorsite-content__title  ">
           </div>

I have tried the (?!) negative lookahead regex to exclude the html tags matches with我已经尝试过 (?!) 否定前瞻正则表达式来排除 html 标签匹配

tags=re.sub('.*?!<[^<]+?>', '',htmlwithcontent )

but despite the fact it doesn't look smart and efficient, obviously, it doesn't work even!但尽管它看起来并不聪明和高效,但显然,它甚至不起作用!

So do you have any Idea?那么你有什么想法吗? preferably in python最好在python中

As Ivar commented, an HTML parser is really the only way to correctly deal with this class of problem:正如 Ivar 评论的那样,HTML 解析器确实是正确处理此类问题的唯一方法:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.indent = -1

    def handle_starttag(self, tag, attrs):
        self.indent += 1
        print(2 * self.indent * ' ', sep='', end='')
        print(f'<{tag}', sep='', end='')
        for attr in attrs:
            print(f' {attr[0]}="{attr[1]}"', sep='', end='')
        print('>', sep='')

    def handle_endtag(self, tag):
        print(2 * self.indent * ' ', sep='', end='')
        print(f'</{tag}>')
        self.indent -= 1

parser = MyHTMLParser()
parser.feed("""<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <h1>Heading!</h1>
    <p style="font-weight: bold; color: red;">
       Some text
       <BR/>
       Some more text
    </p>
    <ol>
       <li>Item 1</li>
       <li>Item 2</li>
     </ol>
  </body>
</html>
""")

Prints:印刷:

<html>
  <head>
    <title>
    </title>
  </head>
  <body>
    <h1>
    </h1>
    <p style="font-weight: bold; color: red;">
      <br>
      </br>
    </p>
    <ol>
      <li>
      </li>
      <li>
      </li>
    </ol>
  </body>
</html>

See Python Demo见 Python 演示

Update更新

If the HTML is is a not-too-large file, it make sense to read the entire file into memory and pass to the parser thus:如果 HTML 是一个不太大的文件,那么将整个文件读入内存并传递给解析器是有意义的:

parser = MyHTMLParser()
with open('test.html') as f:
    html = f.read()
    parser.feed(html)

If the input is in a extremely large file, it might make sense to "feed" the parser line by line or in chunks rather than attempting to read the entire file into memory:如果输入是在一个非常大的文件中,那么逐行或分块“馈送”解析器而不是尝试将整个文件读入内存可能是有意义的:

Line by Line:逐行:

parser = MyHTMLParser()
with open('test.html') as f:
    for line in f:
        parser.feed(line)

Or even more efficiently:或者更有效:

To Read in Chunks of 32K:以 32K 的块读取:

CHUNK_SIZE = 32 * 1024
parser = MyHTMLParser()
with open('test.html') as f:
    while True:
        chunk = f.read(CHUNK_SIZE)
        if chunk == '':
            break
        parser.feed(chunk)

You can, of course, choose even larger chunk sizes.当然,您可以选择更大的块大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM