简体   繁体   English

正则表达式 - 处理null(表达式之间不存在字符时)

[英]Regex- To handle null (when no characters are present between expressions)

I have a regex situation. 我有正则表达式的情况。

My text looks like : 我的文字看起来像:

text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'

I want to capture all the hyperlinks, The regex I have written is given below- 我想捕获所有的超链接,我写的正则表达式如下 -

re.findall("<a href=.+?>(.+?)</a>", text, re.DOTALL)

When I run this it given me an output: 当我运行它时它给了我一个输出:

['</a></div>abcd<i><a href=">World Bank']

The above output occurs because there is no character between 发生上述输出是因为之间没有字符

<a href="></a> 

When I insert any character between the above expressions, I get Correct output. 当我在上面的表达式之间插入任何字符时,我得到正确的输出。

From the above text I need an output that is 从上面的文字我需要一个输出

['World Bank']

How can I modify the regex to get the above output. 如何修改正则表达式以获得上述输出。

Why don't use an HTML Parser instead? 为什么不使用HTML Parser呢?

Example using BeautifulSoup : 使用BeautifulSoup示例:

In [1]: from bs4 import BeautifulSoup

In [2]: text = 'abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
In [3]: soup = BeautifulSoup(text, "html.parser")

In [4]: [a.get_text() for a in soup.find_all("a")]
Out[4]: [u'World Bank']

As mentioned by the other answerer, don't use regex for parsing html files. 正如其他回答者所提到的,不要使用正则表达式来解析html文件。

>>> import re
>>> text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
>>> re.findall(r"(?s)<a href=.+?>([^<>]+)</a>", text)
['World Bank']

[^<>]+ negated character class which matches any character but not of < or > , one or more times. [^<>]+否定字符类,它匹配任何字符但不匹配<> ,一次或多次。 So this would capture World Bank only. 所以这只会占领World Bank

Let me explain why findall produces the undesired output. 让我解释为什么findall会产生不需要的输出。

<a href=.+?>(.+?)</a> 

<a href=.+?> matches all the opening anchor tag. <a href=.+?>匹配所有打开的锚标记。 (.+?)</a> captures one or more characters non-greedily until the closing a tag is reached. (.+?)</a>捕获一个或多个字符的非贪婪地直到关闭a到达标签。 So this would match all the charcaters </a></div>abcd<i><a href=">World Bank until the next </a> . If you use (.*?) then you get two outputs, an empty string and World Bank 所以这将匹配所有的charcaters </a></div>abcd<i><a href=">World Bank直到下一个</a> 。如果你使用(.*?)那么你得到两个输出,一个空字符串和World Bank

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM