在Python中使用正则表达式查找文本中的字符串

Question

I have a text, in which only  and  has been used.for example abcd efg-123 . 我有一条文本，其中仅使用和例如abcd efg-123 。 Can can I extract the string between these tags? 我可以提取这些标签之间的字符串吗？ also I need to extract 3 words before and after this chunk of abcd efg-123 string. 我还需要在abcd efg-123字符串的这一块之前和之后提取3个单词。 How can I do that? 我怎样才能做到这一点？ what would be the suitable regular expression for this? 什么是合适的正则表达式呢？

Answer 1

this will get what's in between the tags, 这将获取标签之间的内容，

>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
...   if "<b>" in i:
...      print i.split("<b>")[-1]
...
abcd efg-123

Answer 2

This is actually a very dumb version and doesn't allow nested tags. 这实际上是一个非常愚蠢的版本，并且不允许嵌套标签。

re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

See Python documentation . 请参阅Python文档。

Answer 3

Handles tags inside the  unless they are  ofcouse. 处理内部标签除非它们是 ofcouse。

import re    
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
      r'(((?:(?:^|\s)+\w+){3}\s*)'            # Match 3 words before
      r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>'  # Match <b>...</b>
      r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after

result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
    ' 1 2 3',
    'abcd efg-123',
    'word word2 word3 ')]

This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser. 这应该可以工作，并且性能良好，但是如果它变得更高级了，那么您应该考虑使用html解析器。

Answer 4

You should not use regexes for HTML parsing. 您不应将正则表达式用于HTML解析。 That way madness lies. 那样疯狂。

The above-linked article actually provides a regex for your problem -- but don't use it. 上面链接的文章实际上为您的问题提供了一个正则表达式-但不要使用它。

在Python中使用正则表达式查找文本中的字符串

问题描述

4 个解决方案

解决方案1
3 2010-10-20 13:49:04

解决方案2
1 2010-10-20 13:50:58

解决方案3
1 已采纳 2010-10-20 14:10:15

解决方案4
0 2010-10-20 13:48:13

在Python中使用正则表达式查找文本中的字符串

问题描述

4 个解决方案

解决方案1 3 2010-10-20 13:49:04

解决方案2 1 2010-10-20 13:50:58

解决方案3 1 已采纳 2010-10-20 14:10:15

解决方案4 0 2010-10-20 13:48:13

解决方案1
3 2010-10-20 13:49:04

解决方案2
1 2010-10-20 13:50:58

解决方案3
1 已采纳 2010-10-20 14:10:15

解决方案4
0 2010-10-20 13:48:13