![](/img/trans.png)
[英]Extract text between two substrings using regular expression multiline in python
[英]Python regular expression extract the text between two values
什么正则表达式提取两个值之间的文本?
在:
<office:annotation office:name="__Annotation__45582_97049284">
</office:annotation>
case 1 there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__45582_97049284"/>
<office:annotation office:name="__Annotation__19324994_2345354">
</office:annotation>
case 2there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__19324994_2345354"/>
出:
list = [
'case 1 there can be an arbitrary text with any symbols',
'case 2 there can be an arbitrary text with any symbols',
]
最好使用XML解析器,如果您想使用正则表达式解决方案,请尝试以下方法,
>>> str = """<office:annotation office:name="__Annotation__45582_97049284">
... </office:annotation>
... case 1 there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__45582_97049284"/>
...
... <office:annotation office:name="__Annotation__19324994_2345354">
... </office:annotation>
... case 2there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__19324994_2345354"/>"""
>>> m = re.findall(r'<\/office:annotation>\s*(.*)(?=\n<office:annotation-end)', str)
>>> m
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']
要么
更好的正则表达式是,
<\/office:annotation>([\w\W\s]*?)(?=\n?<office:annotation-end)
由于这是一个带名称空间的XML文档,因此在选择节点时必须处理这些名称空间。 有关详细信息,请参见此答案 。
这是使用lxml
和xpath
表达式解析它的方法:
data.xml
<?xml version='1.0' encoding='UTF-8'?>
<document xmlns:office="http://www.example.org/office">
<office:annotation office:name="__Annotation__45582_97049284">
</office:annotation>
case 1 there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__45582_97049284"/>
<office:annotation office:name="__Annotation__19324994_2345354">
</office:annotation>
case 2there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__19324994_2345354"/>
</document>
解析
from lxml import etree
tree = etree.parse('data.xml')
root = tree.getroot()
nsmap = root.nsmap
annotations = root.xpath('//office:annotation', namespaces=nsmap)
comments = []
for annotation in annotations:
comment = annotation.tail.strip()
comments.append(comment)
print comments
输出:
['case 1 there can be an arbitrary text with any symbols',
'case 2there can be an arbitrary text with any symbols']
>>> regex = re.compile(r'</.+>\s*(.+)\s*<.+>')
>>> matched = re.findall(regex, text)
>>> print(matched)
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']
编辑:我们去了。 ah ..这些编辑点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.