简体   繁体   English

在Python中使用正则表达式查找文本中的字符串

[英]Finding a strings in a text using regular expressions with Python

I have a text, in which only <b> and </b> has been used.for example <b>abcd efg-123</b> . 我有一条文本,其中仅使用<b></b>例如<b>abcd efg-123</b> Can can I extract the string between these tags? 我可以提取这些标签之间的字符串吗? also I need to extract 3 words before and after this chunk of <b>abcd efg-123</b> string. 我还需要在<b>abcd efg-123</b>字符串的这一块之前和之后提取3个单词。 How can I do that? 我怎样才能做到这一点? what would be the suitable regular expression for this? 什么是合适的正则表达式呢?

this will get what's in between the tags, 这将获取标签之间的内容,

>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
...   if "<b>" in i:
...      print i.split("<b>")[-1]
...
abcd efg-123

This is actually a very dumb version and doesn't allow nested tags. 这实际上是一个非常愚蠢的版本,并且不允许嵌套标签。

re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

See Python documentation . 请参阅Python文档

Handles tags inside the <b> unless they are <b> ofcouse. 处理内部标签<b>除非它们是<b> ofcouse。

import re    
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
      r'(((?:(?:^|\s)+\w+){3}\s*)'            # Match 3 words before
      r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>'  # Match <b>...</b>
      r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after

result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
    ' 1 2 3',
    'abcd efg-123',
    'word word2 word3 ')]

This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser. 这应该可以工作,并且性能良好,但是如果它变得更高级了,那么您应该考虑使用html解析器。

You should not use regexes for HTML parsing. 您不应将正则表达式用于HTML解析。 That way madness lies. 那样疯狂。

The above-linked article actually provides a regex for your problem -- but don't use it. 上面链接的文章实际上为您的问题提供了一个正则表达式-但不要使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM