简体   繁体   English

字符串正则表达式匹配 (DOI)

[英]Regex Match on String (DOI)

Hi I'm struggling to understand why my Regex isn't working.嗨,我很难理解为什么我的正则表达式不起作用。

I have URL's that have DOI's on them like so:我的 URL 上有 DOI,如下所示:

https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true

And I'm using for example this Regex, but it always returns empty?我正在使用例如这个正则表达式,但它总是返回空?

print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Where have I gone wrong?我哪里出错了?

It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i ).看起来您来自另一种编程语言,该语言具有正则表达式文字的概念,这些文字用正斜杠分隔,并且在右斜杠之后有修饰符(因此/i )。

In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters.在 Python 中没有这样的东西,这些斜杠和修饰符被视为文字字符。 For flags like i you can use the optional flags parameter of findall .对于像i这样的标志,您可以使用findall的可选flags参数。

Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10 , so that has to go.其次, ^将匹配输入字符串的开头,但显然您输入的 URL 不以10开头,因此必须为 go。 Instead you could require that the 10 must follow a word break... ie it should not be preceded by an alphanumerical character (or underscore).相反,您可以要求10必须遵循一个分词...即它不应该由一个字母数字字符(或下划线)之前。

Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y , so again the part you are interested in does not go on until the end of the input.同样, $将匹配输入字符串的结尾,但是您的 URL 以 URL 参数继续,例如?nols=y ,因此您感兴趣的部分在输入结束之前不会 go 继续。 So that has to go too.所以这也必须是 go 。

The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.点在正则表达式中具有特殊含义,但您显然打算匹配文字点,因此应该对其进行转义。

Finally, alphanumerical characters can be matched with \w , which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i ( re.I ).最后,字母数字字符可以与\w匹配,它也匹配小写和大写拉丁字母,因此您可以稍微缩短字符 class 并且不使用任何标志,例如i ( re.I )。

This leaves us with:这给我们留下了:

print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w]+', 
                'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM