字符串正则表达式匹配 (DOI)

Question

Hi I'm struggling to understand why my Regex isn't working.嗨，我很难理解为什么我的正则表达式不起作用。

I have URL's that have DOI's on them like so:我的 URL 上有 DOI，如下所示：

https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true

And I'm using for example this Regex, but it always returns empty?我正在使用例如这个正则表达式，但它总是返回空？

print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Where have I gone wrong?我哪里出错了？

Answer 1

It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i ).看起来您来自另一种编程语言，该语言具有正则表达式文字的概念，这些文字用正斜杠分隔，并且在右斜杠之后有修饰符（因此/i ）。

In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters.在 Python 中没有这样的东西，这些斜杠和修饰符被视为文字字符。 For flags like i you can use the optional flags parameter of findall .对于像i这样的标志，您可以使用findall的可选flags参数。

Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10 , so that has to go.其次， ^将匹配输入字符串的开头，但显然您输入的 URL 不以10开头，因此必须为 go。 Instead you could require that the 10 must follow a word break... ie it should not be preceded by an alphanumerical character (or underscore).相反，您可以要求10必须遵循一个分词...即它不应该由一个字母数字字符（或下划线）之前。

Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y , so again the part you are interested in does not go on until the end of the input.同样， $将匹配输入字符串的结尾，但是您的 URL 以 URL 参数继续，例如?nols=y ，因此您感兴趣的部分在输入结束之前不会 go 继续。 So that has to go too.所以这也必须是 go 。

The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.点在正则表达式中具有特殊含义，但您显然打算匹配文字点，因此应该对其进行转义。

Finally, alphanumerical characters can be matched with \w , which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i ( re.I ).最后，字母数字字符可以与\w匹配，它也匹配小写和大写拉丁字母，因此您可以稍微缩短字符 class 并且不使用任何标志，例如i ( re.I )。

This leaves us with:这给我们留下了：

print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w]+', 
                'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

字符串正则表达式匹配 (DOI)

问题描述

1 个解决方案

解决方案1
2 2022-08-01 15:54:53

字符串正则表达式匹配 (DOI)

问题描述

1 个解决方案

解决方案1 2 2022-08-01 15:54:53

解决方案1
2 2022-08-01 15:54:53