用空格覆蓋每個 HTML 標記，這樣每個文本字符的位置就不會移動

Question

在python中，我想用空格覆蓋每個HTML標簽，這樣字符串中每個文本字符的位置就不會改變。

例如<p>將替換為三個空格 .

下面是我編寫實現目標的代碼的最佳嘗試，但對於這樣一個簡單的任務來說，感覺太脆弱和復雜了：

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.html_redacted_with_whitespaces = ""
        HTMLParser.__init__(self)
        
    def handle_starttag(self, tag, attrs):
        self.html_redacted_with_whitespaces +=  " " * (len(tag) + 2)
        
    def handle_endtag(self, tag):
        self.html_redacted_with_whitespaces +=  " " * (len(tag) + 3)
        
    def handle_data(self, data):
        self.html_redacted_with_whitespaces +=  data
        
parser = MyHTMLParser()
test_html = """<html><head><title>Test</title></head>
<body><h1>Replace my tags with spaces!</h1></body></html>"""
parser.feed(test_html)

len(test_html), len(parser.html_redacted_with_whitespaces)
print(test_html)
print(parser.html_redacted_with_whitespaces)

輸出：

<html><head><title>Test</title></head>
<body><h1>Replace my tags with spaces!</h1></body></html>
                   Test               
          Replace my tags with spaces!

我的目標是在將 html 輸入 spacy 之前用空格編輯它。
在將 html 輸入 spacy 之前有必要對其進行編輯，因此 html 標簽確實“混淆”了 nlp 模型。
這個問題部分討論在： https ://github.com/explosion/spaCy/issues/4177

我想保持間距不變的原因是能夠使用 spacy 的 NER doc.ents進行突出顯示。 后來在后期處理中，我在原始 html 中注入了我自己的標簽自己的標簽。

我四處搜索，但找不到交鑰匙解決方案。

Answer 1

您可以使用基於正則表達式的替換，它允許回調知道哪個是替換文本：

import re

test_html = """<html><head><title>Test</title></head>
<body><h1>Replace my tags with spaces!</h1></body></html>"""


filtered_text = re.sub("<.*?>",lambda m: " " * len(m.group(0))   , test_html)

回調為每個匹配傳遞一個正則表達式“匹配”對象，它確實有一個“組”方法，該方法返回正則表達式中每個匹配組的字符串，然后我們使用它的len 。

正則表達式本身是微不足道的，而且有點幼稚：它只是尋找任何打開的<和下一個。 > - 如果對那些使用任何轉義（如在<script> 、 <cdata>或注釋標簽內），它根本不起作用。

用空格覆蓋每個 HTML 標記，這樣每個文本字符的位置就不會移動

問題描述

1 個解決方案

解決方案1
1 已采納 2022-05-31 18:40:25

用空格覆蓋每個 HTML 標記，這樣每個文本字符的位置就不會移動

問題描述

1 個解決方案

解決方案1 1 已采納 2022-05-31 18:40:25

解決方案1
1 已采納 2022-05-31 18:40:25