简体   繁体   English

Python如何从xml文本节点中去除空格

[英]Python how to strip white-spaces from xml text nodes

I have a xml file as follows 我有一个xml文件,如下所示

<Person>
<name>

 My Name

</name>
<Address>My Address</Address>
</Person>

The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml. 标记有多余的新行,是否有任何快速的Pythonic方式来修剪它并生成新的xml。

I found this but it trims only which are between tags not the value https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/ 我发现了这个,但是它只修剪了标签之间的值,而不是值https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/

Update 1 - Handle following xml which has tail spaces in <name> tag 更新1-处理以下xml,该xml在<name>标记中具有尾部空格

<Person>
<name>

 My Name<shortname>My</short>

</name>
<Address>My Address</Address>
</Person>

Accepted answer handle above both kind of xml's 两种类型的xml都接受答案的句柄

Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings 更新2-我在下面的答案中发布了我的版本,我正在使用它删除所有类型的空格并使用xml编码在文件中生成漂亮的xml

https://stackoverflow.com/a/19396130/973699 https://stackoverflow.com/a/19396130/973699

With lxml you can iterate over all elements and check if it has text to strip() : 使用lxml您可以遍历所有元素并检查是否有文本要strip()

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()

print(etree.tostring(root))

It yields: 它产生:

<Person><name>My Name</name>
<Address>My Address</Address>
</Person>

UPDATE to strip tail text too: 更新也删除tail文本:

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

print(etree.tostring(root, encoding="utf-8", xml_declaration=True))

Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file. Birei使用lxml给出的公认答案可以很好地完成这项工作,但是我想修剪所有的白色/空白,空白行并在xml文件中重新生成漂亮的xml。

Following code did what I wanted 以下代码实现了我想要的

from lxml import etree

#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)

root = etree.parse('xmlfile',myparser)

#from Birei's answer 
for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)

You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace ), and add tag markers as needed. 您必须以这种方式进行xml解析,因此也许使用xml.sax并在每个事件处复制到输出流(跳过ignorableWhitespace ),并根据需要添加标签标记。 Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html . 在此处检查示例代码http://www.knowthytools.com/2010/03/sax-parsing-with-python.html

You can use . 您可以使用 Do traverse all elements and for each one that contains some text, replace it with its stripped version: 遍历所有元素,对于每个包含一些文本的元素,将其替换为剥离后的版本:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')

for elem in soup.find_all():
    if elem.string is not None:
        elem.string = elem.string.strip()

print(soup)

Assuming xmlfile with the content provided in the question, it yields: 假设xmlfile具有问题中提供的内容,它将产生:

<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>

I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. 我正在使用旧版本的Python(2.3),并且目前仍在使用标准库。 To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions. 为了显示一个向后兼容的答案,我已经使用xml.domxml.minidom函数编写了此代码。

import codecs
from xml.dom import minidom

# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")

# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")

# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")

stripped_file.close()

While it's not BeautifulSoup , this solution is pretty elegant and uses the full force of the lower-level API. 尽管它不是BeautifulSoup ,但此解决方案非常优雅,并使用了较低级API的全部功能。 Note that the actual formatting is just one line :) 请注意,实际的格式只是一行:)

Documentation of API calls used here: 此处使用的API调用文档:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM