[英]Regex- To handle null (when no characters are present between expressions)
I have a regex situation. 我有正则表达式的情况。
My text looks like : 我的文字看起来像:
text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
I want to capture all the hyperlinks, The regex I have written is given below- 我想捕获所有的超链接,我写的正则表达式如下 -
re.findall("<a href=.+?>(.+?)</a>", text, re.DOTALL)
When I run this it given me an output: 当我运行它时它给了我一个输出:
['</a></div>abcd<i><a href=">World Bank']
The above output occurs because there is no character between 发生上述输出是因为之间没有字符
<a href="></a>
When I insert any character between the above expressions, I get Correct output. 当我在上面的表达式之间插入任何字符时,我得到正确的输出。
From the above text I need an output that is 从上面的文字我需要一个输出
['World Bank']
How can I modify the regex to get the above output. 如何修改正则表达式以获得上述输出。
Why don't use an HTML Parser instead? 为什么不使用HTML Parser呢?
Example using BeautifulSoup
: 使用
BeautifulSoup
示例:
In [1]: from bs4 import BeautifulSoup
In [2]: text = 'abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
In [3]: soup = BeautifulSoup(text, "html.parser")
In [4]: [a.get_text() for a in soup.find_all("a")]
Out[4]: [u'World Bank']
As mentioned by the other answerer, don't use regex for parsing html files. 正如其他回答者所提到的,不要使用正则表达式来解析html文件。
>>> import re
>>> text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
>>> re.findall(r"(?s)<a href=.+?>([^<>]+)</a>", text)
['World Bank']
[^<>]+
negated character class which matches any character but not of <
or >
, one or more times. [^<>]+
否定字符类,它匹配任何字符但不匹配<
或>
,一次或多次。 So this would capture World Bank
only. 所以这只会占领
World Bank
。
Let me explain why findall produces the undesired output. 让我解释为什么findall会产生不需要的输出。
<a href=.+?>(.+?)</a>
<a href=.+?>
matches all the opening anchor tag. <a href=.+?>
匹配所有打开的锚标记。 (.+?)</a>
captures one or more characters non-greedily until the closing a
tag is reached. (.+?)</a>
捕获一个或多个字符的非贪婪地直到关闭a
到达标签。 So this would match all the charcaters </a></div>abcd<i><a href=">World Bank
until the next </a>
. If you use (.*?)
then you get two outputs, an empty string and World Bank
所以这将匹配所有的charcaters
</a></div>abcd<i><a href=">World Bank
直到下一个</a>
。如果你使用(.*?)
那么你得到两个输出,一个空字符串和World Bank
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.