简体   繁体   English

如何使用html2text / BeautifulSoup python删除[font]标签

[英]How to remove [font] tag using html2text/BeautifulSoup python

I'm using BeautifulSoup and get the result from my website, it's a chunk of code with a lot of tags: 我正在使用BeautifulSoup并从我的网站上获得结果,这是带有很多标签的代码块:

<span style="color: blue;"><span style="color: blue;">[font='Times New Roman']<span style="font-size: 22pt;">THIS</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> IS </span>[/font]<span style="color: #FF3300;"><span style="color: #FF3300;">[font='Times New Roman']<span style="font-size: 22pt;">A TEST</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> USING </span>[/font]<span style="color: #00CC66;"><span style="color: #00CC66;">[font='Times New Roman']<span style="font-size: 22pt;">SOME</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> BEAUTIFUL </span>[/font]<span style="color: fuchsia;"><span style="color: fuchsia;">[font='Times New Roman']<span style="font-size: 22pt;">SOUP</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> | </span>[/font]<span style="color: #00CCFF;"><span style="color: #00CCFF;">[font='Times New Roman']<span style="font-size: 22pt;">96786</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> AND </span>[/font]<span style="color: #CC33FF;"><span style="color: #CC33FF;">[font='Times New Roman']<span style="font-size: 22pt;">HTML2TEXT</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> TO LEARN </span>[/font]<span style="color: red;"><span style="color: red;">[font='Times New Roman']<span style="font-size: 22pt;">NEW THING</span>[/font]</span></span>

Then i'm using html2text in order to get the raw text out of that chunk of code by: 然后我正在使用html2text,以便通过以下方式从该代码块中获取原始文本:

h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True
print h.handle(content) #content is that chunk of code

The best result I get so far is: 到目前为止,我得到的最好结果是:

[font='Times New Roman']THIS[/font][font='Times New Roman'] THIS
[/font][font='Times New Roman']IS[/font][font='Times New
Roman'] A TEST [/font][font='Times New Roman']USING[/font][font='Times New
Roman'] BEAUTIFUL [/font][font='Times New Roman'] SOUP [/font][font='Times New Roman']
| [/font][font='Times New Roman']96786[/font][font='Times New Roman'] AND [/font][font='Times New Roman'] HTML2TEXT [/font][font='Times New Roman'] TO LEARN [/font][font='Times New Roman']NEW THING[/font]

How do I get rid of the [font] tag using html2text + beautifulsoup, or any other ways to do that? 如何使用html2text + beautifulsoup或其他方法摆脱[font]标签? Thank you 谢谢

My approach is im using string replace to replace [font ...] and [/font] with '' but that seem inefficient. 我的方法是使用字符串替换将[font ...]和[/ font]替换为“”,但效率似乎较低。 Is it any other ways that we can solve it? 还有其他解决方法吗?

It looks like your input is a mix of HTML and BBCode. 您输入的内容似乎是HTML和BBCode的混合。 BeautifulSoup and html2text are both meant to parse & extract text from HTML, but not BBCode. BeautifulSoup和html2text都旨在解析和提取HTML中的文本,但不是BBCode。

One simple solution would be to convert the [font] BBCode "tags" into HTML before processing with either BeautifulSoup or html2text. 一种简单的解决方案是,在使用BeautifulSoup或html2text处理之前,将[font] BBCode“标签”转换为HTML。 You could use regular expressions to do the conversion, see convert_bbcode_fonts below. 您可以使用正则表达式进行转换,请参见convert_bbcode_fonts (Note that this doesn't actually convert your input to "valid" HTML4 font tags, but html2text still handles the input.) (请注意,这实际上不会将您的输入转换为“有效的” HTML4字体标签,但html2text仍会处理输入。)

import re
import html2text


def convert_bbcode_fonts(html):
    flags = re.IGNORECASE | re.MULTILINE
    # replace start font tags
    html = re.sub(r'\[font\s*([^\]]+)\]', '<font \1>', html, flags=flags)
    # replace end font tags
    html = re.sub(r'\[/font\s*\]', '</font>', html, flags=flags)
    return html

def extract_text(html):
    html = convert_bbcode_fonts(html)
    h = html2text.HTML2Text()
    h.ignore_links = True
    h.ignore_images = True
    h.ignore_emphasis = True
    return h.handle(html)

INPUT = """
<span style="color: blue;"><span style="color: blue;">[font='Times New Roman']<span style="font-size: 22pt;">THIS</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> IS </span>[/font]<span style="color: #FF3300;"><span style="color: #FF3300;">[font='Times New Roman']<span style="font-size: 22pt;">A TEST</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> USING </span>[/font]<span style="color: #00CC66;"><span style="color: #00CC66;">[font='Times New Roman']<span style="font-size: 22pt;">SOME</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> BEAUTIFUL </span>[/font]<span style="color: fuchsia;"><span style="color: fuchsia;">[font='Times New Roman']<span style="font-size: 22pt;">SOUP</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> | </span>[/font]<span style="color: #00CCFF;"><span style="color: #00CCFF;">[font='Times New Roman']<span style="font-size: 22pt;">96786</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> AND </span>[/font]<span style="color: #CC33FF;"><span style="color: #CC33FF;">[font='Times New Roman']<span style="font-size: 22pt;">HTML2TEXT</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> TO LEARN </span>[/font]<span style="color: red;"><span style="color: red;">[font='Times New Roman']<span style="font-size: 22pt;">NEW THING</span>[/font]</span></span>
"""

if __name__ == '__main__':
    print extract_text(INPUT)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM