[英]Finding a strings in a text using regular expressions with Python
I have a text, in which only <b>
and </b>
has been used.for example <b>abcd efg-123</b>
. 我有一条文本,其中仅使用
<b>
和</b>
例如<b>abcd efg-123</b>
。 Can can I extract the string between these tags? 我可以提取这些标签之间的字符串吗? also I need to extract 3 words before and after this chunk of
<b>abcd efg-123</b>
string. 我还需要在
<b>abcd efg-123</b>
字符串的这一块之前和之后提取3个单词。 How can I do that? 我怎样才能做到这一点? what would be the suitable regular expression for this?
什么是合适的正则表达式呢?
this will get what's in between the tags, 这将获取标签之间的内容,
>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
... if "<b>" in i:
... print i.split("<b>")[-1]
...
abcd efg-123
This is actually a very dumb version and doesn't allow nested tags. 这实际上是一个非常愚蠢的版本,并且不允许嵌套标签。
re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)
Handles tags inside the <b>
unless they are <b>
ofcouse. 处理内部标签
<b>
除非它们是<b>
ofcouse。
import re
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
r'(((?:(?:^|\s)+\w+){3}\s*)' # Match 3 words before
r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>' # Match <b>...</b>
r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after
result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
' 1 2 3',
'abcd efg-123',
'word word2 word3 ')]
This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser. 这应该可以工作,并且性能良好,但是如果它变得更高级了,那么您应该考虑使用html解析器。
You should not use regexes for HTML parsing. 您不应将正则表达式用于HTML解析。 That way madness lies.
那样疯狂。
The above-linked article actually provides a regex for your problem -- but don't use it. 上面链接的文章实际上为您的问题提供了一个正则表达式-但不要使用它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.