简体   繁体   中英

BeautifulSoup Prettify custom new line option

I'm using BeautifulSoup to build xml files.

It seems like my two are options are 1) no formatting ie


or 2) with prettify ie


But i would really prefer it to look like this:


I realise i could hack bs4 to achieve this result but i would like to hear if any options exist.

I'm less bothered about the 4-space indent (although that would be nice) and more bothered about the newline after any closing tags or between two opening tags. I'm also intrigued is there a name for this way of formatting as it seems the most sensible way to me.

You can make simple html.HTMLParser to achieve what you want:

from bs4 import BeautifulSoup
from html import escape
from html.parser import HTMLParser

data = '''<root><level1><level2><field1>val1</field1><field2>val2</field2><field3>val3</field3></level2></level1></root>'''

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.__t = 0
        self.lines = []
        self.__current_line = ''
        self.__current_tag = ''

    def __attr_str(attrs):
        return ' '.join('{}="{}"'.format(name, escape(value)) for (name, value) in attrs)

    def handle_starttag(self, tag, attrs):
        if tag != self.__current_tag:
            self.lines += [self.__current_line]

        self.__current_line = '\t' * self.__t + '<{}>'.format(tag + (' ' + self.__attr_str(attrs) if attrs else ''))
        self.__current_tag = tag
        self.__t += 1

    def handle_endtag(self, tag):
        self.__t -= 1
        if tag != self.__current_tag:
            self.lines += [self.__current_line]
            self.lines += ['\t' * self.__t + '</{}>'.format(tag)]
            self.lines += [self.__current_line + '</{}>'.format(tag)]

        self.__current_line = ''

    def handle_data(self, data):
        self.__current_line += data

    def get_parsed_string(self):
        return '\n'.join(l for l in self.lines if l)

parser = MyHTMLParser()

soup = BeautifulSoup(data, 'lxml')
print('BeautifulSoup prettify():')
print('*' * 80)

print('custom html parser:')
print('*' * 80)


BeautifulSoup prettify():
custom html parser:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM