从 html 页面中删除所有样式、脚本和 html 标签

Question

这是我到目前为止所拥有的：

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)

这正在努力删除脚本

Answer 1

看起来你几乎拥有它。 您还需要删除 html 标签和 css 样式代码。 这是我的解决方案（我更新了功能）：

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

Answer 2

您可以使用decompose从文档中完全删除标签，并使用stripped_strings生成器来检索标签内容。

def clean_me(html):
    soup = BeautifulSoup(html)
    for s in soup(['script', 'style']):
        s.decompose()
    return ' '.join(soup.stripped_strings)

>>> clean_me(testhtml) 
'THIS IS AN EXAMPLE I need this text captured And this'

Answer 3

以干净的方式删除指定的标签和注释。 感谢Kim Hyesung提供此代码。

from bs4 import BeautifulSoup
from bs4 import Comment

def cleanMe(html):
    soup = BeautifulSoup(html, "html5lib")    
    [x.extract() for x in soup.find_all('script')]
    [x.extract() for x in soup.find_all('style')]
    [x.extract() for x in soup.find_all('meta')]
    [x.extract() for x in soup.find_all('noscript')]
    [x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
    return soup

Answer 4

使用lxml代替：

# Requirements: pip install lxml

import lxml.html.clean


def cleanme(content):
    cleaner = lxml.html.clean.Cleaner(
        allow_tags=[''],
        remove_unknown_tags=False,
        style=True,
    )
    html = lxml.html.document_fromstring(content)
    html_clean = cleaner.clean_html(html)
    return html_clean.text_content().strip()

testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
cleaned = cleanme(testhtml)
print (cleaned)

Answer 5

如果你想要一个快速而肮脏的解决方案，你可以使用：

re.sub(r'<[^>]*?>', '', value)

在 php 中制作一个相当于strip_tags的东西。 那是你要的吗？

Answer 6

除了styvane答案之外的另一种实现。 如果要提取大量文本，请查看selectolax ，它比lxml快得多

在线IDE中的代码和示例：

def clean_me(html):
    soup = BeautifulSoup(html, 'lxml')

    body = soup.body
    if body is None:
        return None

    # removing everything besides text
    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    plain_text = body.get_text(separator='\n').strip()
    print(plain_text)

clean_me()

从 html 页面中删除所有样式、脚本和 html 标签

问题描述

6 个解决方案

解决方案1
24 已采纳 2015-06-01 03:55:18

解决方案2
14 2015-06-01 04:21:25

解决方案3
6 2018-03-23 00:39:50

解决方案4
4 2019-08-10 13:59:57

解决方案5
2 2015-06-01 04:05:31

解决方案6
0 2021-08-27 16:04:38

从 html 页面中删除所有样式、脚本和 html 标签

问题描述

6 个解决方案

解决方案1 24 已采纳 2015-06-01 03:55:18

解决方案2 14 2015-06-01 04:21:25

解决方案3 6 2018-03-23 00:39:50

解决方案4 4 2019-08-10 13:59:57

解决方案5 2 2015-06-01 04:05:31

解决方案6 0 2021-08-27 16:04:38

解决方案1
24 已采纳 2015-06-01 03:55:18

解决方案2
14 2015-06-01 04:21:25

解决方案3
6 2018-03-23 00:39:50

解决方案4
4 2019-08-10 13:59:57

解决方案5
2 2015-06-01 04:05:31

解决方案6
0 2021-08-27 16:04:38