简体   繁体   English

如何使用python从一行xml解析文本

[英]How to parse text from a single line of xml by using python

I have a single line of xml and would like to parse all text parts into a list of text. 我只有一行xml,想将所有文本部分解析为文本列表。

text = '<string name="status">Finishing <xliff:g id="number">%d</xliff:g> percent.</string>'

My desired output: 我想要的输出:

desired_output = ['Finishing', '%d', 'percent.']

I used regular expression for this simple task. 我为这个简单的任务使用了正则表达式。

import re
pattern = re.compile(r'>.+<')
match = re.findall(pattern, text)

match = ['>Finishing <xliff:g id="number">%d</xliff:g> percent.<']

It seems regular expression failed to get my desired output. 似乎正则表达式无法获得我想要的输出。

I don't know Python well, but I do know that parsing XML with regular expressions is setting yourself up for a world of pain . 我不太了解Python,但是我确实知道,使用正则表达式解析XML会让您为之痛苦 Try something like this using ElementTree instead, tested in Python 2.7: 尝试使用在Python 2.7中测试过的ElementTree来尝试类似的事情:

import xml.etree.cElementTree as ElementTree
xml_text='<string name="status">Finishing <xliff:g id="number">%d</xliff:g> percent.</string>'
xml=ElementTree.fromstring('<data xmlns:xliff="foo">' + xml_text + '</data>')
print ElementTree.tostring(xml, method='text')

Output: 输出:

>>> Finishing %d percent.

Note because there's a namespace in the XML, it needed a wrapper placed around the text. 注意,因为XML中有一个命名空间,所以需要在文本周围放置一个包装器。 If your actual XML already has the namespace declared, it can be skipped. 如果您的实际XML已经声明了名称空间,则可以跳过它。

update your regex to this 将您的正则表达式更新为此

 pattern = re.compile(r'. *?>(.+?)<')

if you are working with xml/html parsing you might consider using Beautifulsoup ,it will save you a great deal of time to write more regex but if you want to learn regex then it will be by trial and error 如果您正在使用xml / html解析,则可以考虑使用Beautifulsoup ,它将为您节省大量时间来编写更多正则表达式,但是如果您想学习正则表达式,则需要反复尝试

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM