简体   繁体   English

BeautifulSoup Prettify自定义换行选项

[英]BeautifulSoup Prettify custom new line option

I'm using BeautifulSoup to build xml files. 我正在使用BeautifulSoup构建xml文件。

It seems like my two are options are 1) no formatting ie 看来我的两个选项是1)没有格式,即


or 2) with prettify ie 或2)与美化,即


But i would really prefer it to look like this: 但我真的希望它看起来像这样:


I realise i could hack bs4 to achieve this result but i would like to hear if any options exist. 我意识到我可以破解bs4来实现此结果,但我想听听是否存在任何选项。

I'm less bothered about the 4-space indent (although that would be nice) and more bothered about the newline after any closing tags or between two opening tags. 我不太担心4空格缩进(尽管那会很好),而对于任何结束标记之后或两个开始标记之间的换行都比较烦恼。 I'm also intrigued is there a name for this way of formatting as it seems the most sensible way to me. 我也很感兴趣这种格式化方式,因为这对我来说似乎是最明智的方式。

You can make simple html.HTMLParser to achieve what you want: 您可以制作简单的html.HTMLParser来实现所需的功能:

from bs4 import BeautifulSoup
from html import escape
from html.parser import HTMLParser

data = '''<root><level1><level2><field1>val1</field1><field2>val2</field2><field3>val3</field3></level2></level1></root>'''

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.__t = 0
        self.lines = []
        self.__current_line = ''
        self.__current_tag = ''

    def __attr_str(attrs):
        return ' '.join('{}="{}"'.format(name, escape(value)) for (name, value) in attrs)

    def handle_starttag(self, tag, attrs):
        if tag != self.__current_tag:
            self.lines += [self.__current_line]

        self.__current_line = '\t' * self.__t + '<{}>'.format(tag + (' ' + self.__attr_str(attrs) if attrs else ''))
        self.__current_tag = tag
        self.__t += 1

    def handle_endtag(self, tag):
        self.__t -= 1
        if tag != self.__current_tag:
            self.lines += [self.__current_line]
            self.lines += ['\t' * self.__t + '</{}>'.format(tag)]
            self.lines += [self.__current_line + '</{}>'.format(tag)]

        self.__current_line = ''

    def handle_data(self, data):
        self.__current_line += data

    def get_parsed_string(self):
        return '\n'.join(l for l in self.lines if l)

parser = MyHTMLParser()

soup = BeautifulSoup(data, 'lxml')
print('BeautifulSoup prettify():')
print('*' * 80)

print('custom html parser:')
print('*' * 80)

Prints: 印刷品:

BeautifulSoup prettify():
custom html parser:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM