使用 Python 删除部分 HTML 文本

Question

我有以下结构的很长的 HTML 文本：

<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>

现在，假设我想将 HTML 文本修剪为仅 1000 个字符，但我仍然希望 HTML 有效，即关闭关闭标签被删除的标签。 如何使用 Python 更正修剪后的 HTML 文本？ 请注意，HTML 的结构并不总是如上。

我需要这个用于电子邮件活动，其中发送了博客的预览，但收件人需要访问博客的 URL 才能查看完整的文章。

Answer 1

美汤怎么样？ (python-bs4)

from bs4 import BeautifulSoup

test_html = """<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>"""

test_html = test_html[0:50]
soup = BeautifulSoup(test_html, 'html.parser')

print(soup.prettify())

.prettify() 应该自动关闭标签。

Answer 2

我可以举个例子。 如果它看起来像这样：

 <div> <p>Long text...</p> <p>Longer text to be trimmed</p> </div>

你有一个 Python 代码，如：

def TrimHTML(HtmlString):
    result = []
    newlinesremaining = 2 # or some other value of your choice
    foundlastpart = False
    for x in list(HtmlString): # being HtmlString the html to be trimmed
        if not newlinesremaining < 1:
            if x == '\n':
                newlinesremaining -= 1
            result.append(x)
        elif foundlastpart == False:
            if x == \n:
                newlinesremaining = float('inf')
                foundlastpart == True
        return result.join('')

然后运行该代码，在函数中输入上面的示例 HTML，然后函数返回：

 <div> <p>Long text...</p> </div>

由于某些可能很奇怪的原因，我无法在工作前的短时间内对其进行测试。

使用 Python 删除部分 HTML 文本

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-11-10 17:07:36

解决方案2
0 2015-11-10 16:39:19

使用 Python 删除部分 HTML 文本

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-11-10 17:07:36

解决方案2 0 2015-11-10 16:39:19

解决方案1
1 已采纳 2015-11-10 17:07:36

解决方案2
0 2015-11-10 16:39:19