正则表达式 - 处理null（表达式之间不存在字符时）

Question

I have a regex situation. 我有正则表达式的情况。

My text looks like : 我的文字看起来像：

text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'

I want to capture all the hyperlinks, The regex I have written is given below- 我想捕获所有的超链接，我写的正则表达式如下 -

re.findall("<a href=.+?>(.+?)</a>", text, re.DOTALL)

When I run this it given me an output: 当我运行它时它给了我一个输出：

['</a></div>abcd<i><a href=">World Bank']

The above output occurs because there is no character between 发生上述输出是因为之间没有字符

<a href="></a>

When I insert any character between the above expressions, I get Correct output. 当我在上面的表达式之间插入任何字符时，我得到正确的输出。

From the above text I need an output that is 从上面的文字我需要一个输出

['World Bank']

How can I modify the regex to get the above output. 如何修改正则表达式以获得上述输出。

Answer 1

Why don't use an HTML Parser instead? 为什么不使用HTML Parser呢？

Example using BeautifulSoup : 使用BeautifulSoup示例：

In [1]: from bs4 import BeautifulSoup

In [2]: text = 'abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
In [3]: soup = BeautifulSoup(text, "html.parser")

In [4]: [a.get_text() for a in soup.find_all("a")]
Out[4]: [u'World Bank']

Answer 2

As mentioned by the other answerer, don't use regex for parsing html files. 正如其他回答者所提到的，不要使用正则表达式来解析html文件。

>>> import re
>>> text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
>>> re.findall(r"(?s)<a href=.+?>([^<>]+)</a>", text)
['World Bank']

[^<>]+ negated character class which matches any character but not of < or > , one or more times. [^<>]+否定字符类，它匹配任何字符但不匹配<或> ，一次或多次。 So this would capture World Bank only. 所以这只会占领World Bank 。

Let me explain why findall produces the undesired output. 让我解释为什么findall会产生不需要的输出。

<a href=.+?>(.+?)</a>

<a href=.+?> matches all the opening anchor tag. <a href=.+?>匹配所有打开的锚标记。 (.+?)</a> captures one or more characters non-greedily until the closing a tag is reached. (.+?)</a>捕获一个或多个字符的非贪婪地直到关闭a到达标签。 So this would match all the charcaters </a></div>abcd<i><a href=">World Bank until the next </a> . If you use (.*?) then you get two outputs, an empty string and World Bank 所以这将匹配所有的charcaters </a></div>abcd<i><a href=">World Bank直到下一个</a> 。如果你使用(.*?)那么你得到两个输出，一个空字符串和World Bank

正则表达式 - 处理null（表达式之间不存在字符时）

问题描述

2 个解决方案

解决方案1
3 2015-10-26 15:01:33

解决方案2
0 已采纳 2015-10-26 15:20:22

正则表达式 - 处理null（表达式之间不存在字符时）

问题描述

2 个解决方案

解决方案1 3 2015-10-26 15:01:33

解决方案2 0 已采纳 2015-10-26 15:20:22

解决方案1
3 2015-10-26 15:01:33

解决方案2
0 已采纳 2015-10-26 15:20:22