简体   繁体   English

让 BeautifulSoup 像浏览器一样处理换行符

[英]Make BeautifulSoup handle line breaks as a browser would

I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text.我正在使用 BeautifulSoup(带有 Python 3.4 的“4.3.2”版)将 html 文档转换为文本。 The problem I'm having is that sometimes web pages have newline characters "\\n" that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the "\\n".我遇到的问题是有时网页有换行符“\\n”,实际上不会在浏览器中呈现为新行,但是当 BeautifulSoup 将它们转换为文本时,它会留在“\\n”中。

Example:例子:

Your browser probably renders the following all in one line (even though have a newline character in the middle):您的浏览器可能会在一行中呈现以下所有内容(即使中间有一个换行符):

This is a paragraph.这是一个段落。

And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:即使我没有换行,您的浏览器也可能会在多行中呈现以下内容:

This is a paragraph.这是一个段落。

This is another paragraph.这是另一段。

But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:但是当 BeautifulSoup 将相同的字符串转换为文本时,它使用的唯一换行符是换行符 - 它总是使用它们:

from bs4 import BeautifulSoup

doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a \n paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)?有谁知道如何让 BeautifulSoup 以更漂亮的方式提取文本(或者真的只是让所有的换行符都正确)? Are there any other simple ways around the problem?有没有其他简单的方法可以解决这个问题?

get_text might be helpful here: get_text在这里可能会有所帮助:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

While I do realize this is an old post, I wanted to highlight some behavior in bs4 in the way text is printed from tags.虽然我确实意识到这是一篇旧帖子,但我想突出显示 bs4 中从标签打印文本的方式中的一些行为。

I'm not an html expert but these are the few things I considered while trying to make bs4 print text as a browser would -我不是 html 专家,但这些是我在尝试像浏览器那样制作 bs4 打印文本时考虑的几件事 -

The behaviors I'm about to describe are applicable to tag.get_text() and tag.find_all(text=True,recursive=True) functionalities in BeautifulSoup我将要描述的行为适用于BeautifulSoup 中的 tag.get_text()tag.find_all(text=True,recursive=True)功能

1) New Lines in the HTML Source 1) HTML 源代码中的新行

Beautiful soup prints a new line if it is available in the html source如果 html 源代码中可用,美丽的汤会打印一个新行

2) Implicit new lines due to block level elements 2) 由于块级元素的隐式换行

Beautiful soup does not add new lines before and after block elements like 'p' if there are no source new lines around the tag如果标签周围没有源新行,Beautiful Soup 不会在块元素之前和之后添加新行,例如 'p'

3) <br> tags 3) <br> 标签

BeautifulSoup does not print a new line if the source contains a <br> tag and there are no source new lines around the <br> tag如果源代码包含 <br> 标记并且 <br> 标记周围没有源代码新行,BeautifulSoup 不会打印新行

Solution解决方案

Here is a solution that works for many cases (the limiting factor being -1)The list of all inline elements 2) How CSS/JS might affect the inline-ness or block-ness at runtime in a browser environment这是一个适用于许多情况的解决方案(限制因素是 -1)所有内联元素的列表 2)CSS/JS 如何在浏览器环境中影响运行时的内联性或块性

def get_text(tag:bs4.Tag) -> str:
    _inline_elements = {"a","span","em","strong","u","i","font","mark","label","s","sub","sup","tt","bdo","button","cite","del","b","a","font"}

    def _get_text(tag:bs4.Tag) -> Generator:
     
        for child in tag.children:
            if type(child) is Tag:
                # if the tag is a block type tag then yield new lines before after
                is_block_element = child.name not in _inline_elements
                if is_block_element: yield "\n"
                yield from ["\n"] if child.name=="br" else  _get_text(child)
                if is_block_element: yield "\n"
            elif type(child) is NavigableString:
                yield child.string
     return "".join(_get_text(tag))

I use the following small library to accomplish this:我使用以下小型库来完成此操作:

https://github.com/TeamHG-Memex/html-text https://github.com/TeamHG-Memex/html-text

pip install html-text

As simple as:就这么简单:

>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'

I would take a look at python-markdownify .我会看看python-markdownify It turns html into pretty readable text in markdown format.它将 html 转换为 Markdown 格式的可读性很强的文本。

It is available at pypi : https://pypi.python.org/pypi/markdownify/0.4.0它在 pypi 上可用: https ://pypi.python.org/pypi/markdownify/0.4.0

and github : https://github.com/matthewwithanm/python-markdownify和 github: https : //github.com/matthewwithanm/python-markdownify

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM