简体   繁体   English

如何从字符串末尾向后剥离模式或单词?

[英]How do I strip patterns or words from the end of the string backwards?

I have a string like this: 我有一个像这样的字符串:

<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>

I would like to strip the first 3 opening and the last 3 closing tags from the string. 我想从字符串中剥离前3个开始标记后3个结束标记。 I do not know the tag names in advance. 我事先不知道标签名称。

I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)) . 我可以使用re.sub(r'<[^<>]+>', '', in_str, 3))剥离前三个字符串。 How do I strip the closing tags? 如何剥离结束标签? What should remain is: 应该保留的是:

<v1>aaa<b>bbb</b>ccc</v1>

I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes. 我知道我可以“正确地做”,但是我实际上不希望出于我的目的而进行xml或html解析,这是为了帮助自己可视化某些类的xml表示形式。

Instead, I realized that this problem is interesting. 相反,我意识到这个问题很有趣。 It seems I cannot simply search backwards with regex, ie. 看来我不能简单地使用正则表达式向后搜索。 right to left . 从右到左 because that seems unsupported : 因为这似乎不受支持

If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. 如果您的意思是找到多个(与字符串的rfind方法类似)的最右匹配项,则为否,则不直接支持它。 You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result. 您可以使用re.findall()并选择最后一个匹配项,但是如果匹配项可以重叠,则可能无法给出正确的结果。

But .rstrip is not good with words, and won't do patterns either. 但是.rstrip单词效果不好,也不会做任何模式。

I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags. 我看过从Python的字符串中剥离HTML,但我只希望剥离多达3个标签。

What approach could be used here? 在这里可以使用什么方法? Should I reverse the string (ugly in itself and due to the '<>'s). 我是否应该反转字符串(其本身很丑并且由于'<>'s)。 Do tokenization (why not parse, then?)? 做令牌化(为什么不解析呢?)? Or create static closing tags based on the left-to-right match? 还是根据从左到右的匹配方式创建静态的结束标记?

Which strategy to follow to strip the patterns from the end of the string? 采取哪种策略从字符串末尾剥离模式?

The simplest would be to use old-fashing string splitting and limiting the split: 最简单的方法是使用老式的字符串拆分并限制拆分:

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo: 演示:

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit. 带有限制的str.split()str.rsplit()将从开始或结束到限制时间分割字符串,让您选择未分割的其余部分。

You've already got practically all the solution. 您已经拥有了几乎所有的解决方案。 re can't do backwards, but you can: re不能向后做,但是您可以:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front. 注意反向字符串的反向正则表达式,但随后它回到了前面。

Of course, as mentioned, this is way easier with a proper parser: 当然,如上所述,使用适当的解析器会更容易:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

I would look into regular expressions and use one such pattern to use a split 我会研究正则表达式,并使用一种这样的模式来使用拆分

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

Sorry, can't comment, but will give it as an answer. 抱歉,无法发表评论,但会给出答案。

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo> , but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another> . in_str.split('>', 3)[-1].rsplit('<', 3)[0]适用于给定的示例<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo> ,但不适用于<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another> You just should be aware of this. 您只应该意识到这一点。

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs. 为了解决我提供的反例,您必须跟踪标签的状态(或计数)并评估您是否匹配正确的对。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从Python中删除字符串末尾的逗号? - How do I strip the comma from the end of a string in Python? 如何从 pandas 系列中的字符串中去除“$”符号? - How do I strip the “$” symbol from a string in a pandas series? 如何从Python中的字符串中删除以“:”结尾的所有单词? - How can I remove all words that end in “:” from a string in Python? 如何使用正则表达式从字符串末尾删除潜在的重复模式? - How do you use regex to remove potentially repeating patterns from the end of a string? 如何从字段中删除 1 和/或 2 个单词 - How to strip 1 and/or 2 words from field 如何从POS标记词列表中提取模式? NLTK - How do I extract patterns from lists of POS tagged words? NLTK 如何在python中从字符串的开头和结尾去除特殊字符 - How to strip special characters from the start and end of the string in python 如果字符串包含列表中的后缀,如何从字符串中删除该特定后缀? - If a string contains a suffix from a list, how do I strip that specific suffix from the string? 如何通过循环向后访问字符串的一部分 - How do I access a part of a string by by going backwards in a loop 如何从列表中的字符串末尾去除特定标点符号并使所有单词小写 - How to strip specific punctuation from the end of strings in a list and make all words lower case
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM