Python正则表达式提取两个值之间的文本

Question

什么正则表达式提取两个值之间的文本？

在：

<office:annotation office:name="__Annotation__45582_97049284">
</office:annotation>
    case 1 there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__45582_97049284"/>

<office:annotation office:name="__Annotation__19324994_2345354">
</office:annotation>
    case 2there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__19324994_2345354"/>

出：

list = [
'case 1 there can be an arbitrary text with any symbols',
'case 2 there can be an arbitrary text with any symbols',
]

Answer 1

最好使用XML解析器，如果您想使用正则表达式解决方案，请尝试以下方法，

>>> str = """<office:annotation office:name="__Annotation__45582_97049284">
... </office:annotation>
...     case 1 there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__45582_97049284"/>
... 
... <office:annotation office:name="__Annotation__19324994_2345354">
... </office:annotation>
...     case 2there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__19324994_2345354"/>"""
>>> m = re.findall(r'<\/office:annotation>\s*(.*)(?=\n<office:annotation-end)', str)
>>> m
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']

要么

更好的正则表达式是，

<\/office:annotation>([\w\W\s]*?)(?=\n?<office:annotation-end)

Answer 2

由于这是一个带名称空间的XML文档，因此在选择节点时必须处理这些名称空间。 有关详细信息，请参见此答案。

这是使用lxml和xpath表达式解析它的方法：

data.xml

<?xml version='1.0' encoding='UTF-8'?>
<document xmlns:office="http://www.example.org/office">

    <office:annotation office:name="__Annotation__45582_97049284">
    </office:annotation>
        case 1 there can be an arbitrary text with any symbols
    <office:annotation-end office:name="__Annotation__45582_97049284"/>

    <office:annotation office:name="__Annotation__19324994_2345354">
    </office:annotation>
        case 2there can be an arbitrary text with any symbols
    <office:annotation-end office:name="__Annotation__19324994_2345354"/>

</document>

解析

from lxml import etree

tree = etree.parse('data.xml')
root = tree.getroot()
nsmap = root.nsmap

annotations = root.xpath('//office:annotation', namespaces=nsmap)

comments = []
for annotation in annotations:
    comment = annotation.tail.strip()
    comments.append(comment)

print comments

输出：

['case 1 there can be an arbitrary text with any symbols',
 'case 2there can be an arbitrary text with any symbols']

Answer 3

>>> regex = re.compile(r'</.+>\s*(.+)\s*<.+>')
>>> matched = re.findall(regex, text)
>>> print(matched)
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']

编辑：我们去了。 ah ..这些编辑点。

Python正则表达式提取两个值之间的文本

问题描述

3 个解决方案

解决方案1
3 已采纳 2014-07-05 12:58:47

解决方案2
0 2014-07-05 13:21:48

解决方案3
0 2014-07-05 13:22:10

Python正则表达式提取两个值之间的文本

问题描述

3 个解决方案

解决方案1 3 已采纳 2014-07-05 12:58:47

解决方案2 0 2014-07-05 13:21:48

解决方案3 0 2014-07-05 13:22:10

解决方案1
3 已采纳 2014-07-05 12:58:47

解决方案2
0 2014-07-05 13:21:48

解决方案3
0 2014-07-05 13:22:10