如何从字符串末尾向后剥离模式或单词？

Question

I have a string like this: 我有一个像这样的字符串：

<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>

I would like to strip the first 3 opening and the last 3 closing tags from the string. 我想从字符串中剥离前3个开始标记和后3个结束标记。 I do not know the tag names in advance. 我事先不知道标签名称。

I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)) . 我可以使用re.sub(r'<[^<>]+>', '', in_str, 3))剥离前三个字符串。 How do I strip the closing tags? 如何剥离结束标签？ What should remain is: 应该保留的是：

<v1>aaa<b>bbb</b>ccc</v1>

I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes. 我知道我可以“正确地做”，但是我实际上不希望出于我的目的而进行xml或html解析，这是为了帮助自己可视化某些类的xml表示形式。

Instead, I realized that this problem is interesting. 相反，我意识到这个问题很有趣。 It seems I cannot simply search backwards with regex, ie. 看来我不能简单地使用正则表达式向后搜索。 right to left . 从右到左 。 because that seems unsupported : 因为这似乎不受支持：

If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. 如果您的意思是找到多个（与字符串的rfind方法类似）的最右匹配项，则为否，则不直接支持它。 You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result. 您可以使用re.findall（）并选择最后一个匹配项，但是如果匹配项可以重叠，则可能无法给出正确的结果。

But .rstrip is not good with words, and won't do patterns either. 但是.rstrip单词效果不好，也不会做任何模式。

I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags. 我看过从Python的字符串中剥离HTML，但我只希望剥离多达3个标签。

What approach could be used here? 在这里可以使用什么方法？ Should I reverse the string (ugly in itself and due to the '<>'s). 我是否应该反转字符串（其本身很丑并且由于'<>'s）。 Do tokenization (why not parse, then?)? 做令牌化（为什么不解析呢？）？ Or create static closing tags based on the left-to-right match? 还是根据从左到右的匹配方式创建静态的结束标记？

Which strategy to follow to strip the patterns from the end of the string? 采取哪种策略从字符串末尾剥离模式？

Answer 1

The simplest would be to use old-fashing string splitting and limiting the split: 最简单的方法是使用老式的字符串拆分并限制拆分：

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo: 演示：

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit. 带有限制的str.split()和str.rsplit()将从开始或结束到限制时间分割字符串，让您选择未分割的其余部分。

Answer 2

You've already got practically all the solution. 您已经拥有了几乎所有的解决方案。 re can't do backwards, but you can: re不能向后做，但是您可以：

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front. 注意反向字符串的反向正则表达式，但随后它回到了前面。

Of course, as mentioned, this is way easier with a proper parser: 当然，如上所述，使用适当的解析器会更容易：

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

Answer 3

I would look into regular expressions and use one such pattern to use a split 我会研究正则表达式，并使用一种这样的模式来使用拆分

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

Answer 4

Sorry, can't comment, but will give it as an answer. 抱歉，无法发表评论，但会给出答案。

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaabbbccc</v1></k2></bar></foo> , but not for <foo><bar><k2><v1>aaabbbccc</v1></k2></bar></foo><another>test</another> . in_str.split('>', 3)[-1].rsplit('<', 3)[0]适用于给定的示例<foo><bar><k2><v1>aaabbbccc</v1></k2></bar></foo> ，但不适用于<foo><bar><k2><v1>aaabbbccc</v1></k2></bar></foo><another>test</another> 。 You just should be aware of this. 您只应该意识到这一点。

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs. 为了解决我提供的反例，您必须跟踪标签的状态（或计数）并评估您是否匹配正确的对。

如何从字符串末尾向后剥离模式或单词？

问题描述

4 个解决方案

解决方案1
3 已采纳 2014-03-18 12:10:18

解决方案2
2 2014-03-18 12:26:31

解决方案3
1 2014-03-18 12:14:54

解决方案4
1 2014-03-18 14:21:23

如何从字符串末尾向后剥离模式或单词？

问题描述

4 个解决方案

解决方案1 3 已采纳 2014-03-18 12:10:18

解决方案2 2 2014-03-18 12:26:31

解决方案3 1 2014-03-18 12:14:54

解决方案4 1 2014-03-18 14:21:23

解决方案1
3 已采纳 2014-03-18 12:10:18

解决方案2
2 2014-03-18 12:26:31

解决方案3
1 2014-03-18 12:14:54

解决方案4
1 2014-03-18 14:21:23