Python正则表达式模式以匹配xml字符串中的文本

Question

I'm parsing an XML file and need to remove some clutter from the final output. 我正在解析XML文件，需要从最终输出中消除一些混乱。

str = <?xml version="1.0" encoding="UTF-8" standalone="yes"?><chat-message>2018-10

my attempt at a solution is: 我的解决方案尝试是：

re.sub(r'<(\w|\d|\s){1,}>{1,4}',"",str)

and my desired output is: 我想要的输出是：

2018-10 2018-10

Currently Python is finding no matches and just returning str . 目前，Python找不到匹配项，只是返回str 。 I don't think < or > are special characters so no escaping needed; 我认为<或>不是特殊字符，因此不需要转义。 I tried escaping anyway and it still did not work. 无论如何，我都尝试过转义，但仍然没有成功。

Answer 1

In my opinion you are better off using an XML parser rather than regex. 我认为您最好使用XML解析器而不是正则表达式。 Here is an example using xml.etree.ElementTree : 这是使用xml.etree.ElementTree的示例：

import xml.etree.ElementTree as ET

xmlstring = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><chat-message>2018-10</chat-message>'
root = ET.fromstring(xmlstring)

print(root.text)
# OUTPUT
# 2018-10

Answer 2

您可以尝试一些更简单的方法：

re.sub(r'<.*?>', '', str)

Answer 3

This regex works for the test case in your question - 此正则表达式适用于您问题中的测试用例-

r"<[\w\D]+>([-\d]+)"

You can test it here - 您可以在这里进行测试-

https://regex101.com/ https://regex101.com/

Python正则表达式模式以匹配xml字符串中的文本

问题描述

3 个解决方案

解决方案1
4 2018-12-20 19:20:37

解决方案2
1 已采纳 2018-12-20 19:14:41

解决方案3
0 2018-12-20 19:46:59

Python正则表达式模式以匹配xml字符串中的文本

问题描述

3 个解决方案

解决方案1 4 2018-12-20 19:20:37

解决方案2 1 已采纳 2018-12-20 19:14:41

解决方案3 0 2018-12-20 19:46:59

解决方案1
4 2018-12-20 19:20:37

解决方案2
1 已采纳 2018-12-20 19:14:41

解决方案3
0 2018-12-20 19:46:59