简体   繁体   English

如何在正则表达式中匹配'或'与'或'

[英]How to match the ' or “ with ' or ” in regular expression

The following regular expression is used to extract the URL link from a page: 以下正则表达式用于从页面中提取URL链接:

LINK_REGEX = re.compile("<a [^>]*href=['\"]([^'\"]+)['\"][^>]*>")

Question1 > How to represent the following string? 问题1 >如何表示以下字符串? I mismatch the ' and " in purpose 我故意错配'和'

<a href="http://www.yahoo.com'>

I have tried the following statements and none work for me. 我尝试过以下陈述,但没有一个适合我。

>>> page = '<a href="http://www.yahoo.com\'>'
>>> page
'<a href="http://www.yahoo.com\'>'
>>> page = '<a href="http://www.yahoo.com''>'
>>> page
'<a href="http://www.yahoo.com>'

Question2 > Based on my understanding, by design, the LINK_REGEX will match above link although this is not desirable. 问题2 >根据我的理解,按照设计,LINK_REGEX将匹配上述链接,尽管这是不可取的。 So how can I modify the regular expression so that it enforces the matching of ' with ' or " with ". 那么如何修改正则表达式以便强制匹配'with'或“with”。

For Question 1, your first approach works. 对于问题1,您的第一种方法是有效的。

>>> page = '<a href="http://www.yahoo.com\'>'
>>> len(page)
31
>>> page
'<a href="http://www.yahoo.com\'>'
>>> page[-1]
'>'
>>> page[-2]
"'"
>>> page[-3]
'm'

(I'd post this as a comment if I had the privilege.) (如果我有这个特权,我会把它发表评论。)

If you're trying to parse HTML, it is highly recommended that you do not use regex. 如果您正在尝试解析HTML,强烈建议您不要使用正则表达式。 You'll be saving yourself lots of hassle and problems if you use an HTML parsing module like BeautifulSoup or lxml.html. 如果您使用像BeautifulSoup或lxml.html这样的HTML解析模块,那么您将节省很多麻烦和问题。

Second, pretty much every time you're using regex, be sure to prepend r to your string, like so: 其次,几乎每次使用正则表达式时,请务必将r到字符串中,如下所示:

LINK_REGEX = re.compile(r"<a [^>]*href=['\\"]([^'\\"]+)['\\"][^>]*>")

This will ensure things are escaped properly. 这将确保事情正确转义。

If you definitely need to use regex though, "9000's" answer will work for you. 如果您肯定需要使用正则表达式,“9000”的答案将适合您。

['"] will match ' or " . ['"]将匹配'"

(['"]).+\\1 will match a quoted string with matcing quotes. The expression in parens (match group) will match a single or double quote, and \\1 will match whatever the first match group have matched (this is called 'backreference'). (['"]).+\\1将匹配带引号引号的带引号的字符串.parens(匹配组)中的表达式将匹配单引号或双引号, \\1将匹配第一个匹配组匹配的任何内容(这是称为“反向引用”)。

Note that the quotes are not escaped in any way in the expressions to make them more readable. 请注意 ,引号不会以任何方式在表达式中进行转义,以使它们更具可读性。 Your regex strings may need to escape at least one kind of quotes. 您的正则表达式字符串可能需要至少转义一种引号。

Use two regexes: 使用两个正则表达式:

<a\s*[^>]*href="([^"]+)"[^>]*>  # double quoted strings
<a\s*[^>]*href='([^']+)'[^>]*>  # single quoted strings

The content of href will then be in the second group. 然后href的内容将在第二组中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM