简体   繁体   English

Python正则表达式模式以匹配xml字符串中的文本

[英]Python regex pattern to match text inside xml string

I'm parsing an XML file and need to remove some clutter from the final output. 我正在解析XML文件,需要从最终输出中消除一些混乱。

str = <?xml version="1.0" encoding="UTF-8" standalone="yes"?><chat-message>2018-10

my attempt at a solution is: 我的解决方案尝试是:

re.sub(r'<(\w|\d|\s){1,}>{1,4}',"",str)

and my desired output is: 我想要的输出是:

2018-10 2018-10

Currently Python is finding no matches and just returning str . 目前,Python找不到匹配项,只是返回str I don't think < or > are special characters so no escaping needed; 我认为<>不是特殊字符,因此不需要转义。 I tried escaping anyway and it still did not work. 无论如何,我都尝试过转义,但仍然没有成功。

In my opinion you are better off using an XML parser rather than regex. 我认为您最好使用XML解析器而不是正则表达式。 Here is an example using xml.etree.ElementTree : 这是使用xml.etree.ElementTree的示例:

import xml.etree.ElementTree as ET

xmlstring = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><chat-message>2018-10</chat-message>'
root = ET.fromstring(xmlstring)

print(root.text)
# OUTPUT
# 2018-10

您可以尝试一些更简单的方法:

re.sub(r'<.*?>', '', str)

This regex works for the test case in your question - 此正则表达式适用于您问题中的测试用例-

r"<[\w\D]+>([-\d]+)"

You can test it here - 您可以在这里进行测试-

https://regex101.com/ https://regex101.com/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM