正则表达式 - 如果不匹配则匹配 - Python

Question

I apologise for the amount of text, but I cannot wrap my head around this and I would like to make my problem clear. 我为文字的数量道歉，但我无法理解这一点，我想让我的问题清楚。

I am currently attempting to create a regex expression to find the end of a website/email link in order to then process the rest of the address. 我目前正在尝试创建一个正则表达式来查找网站/电子邮件链接的结尾，然后处理其余的地址。 I have decided to look for the ending of the address (eg. '.com', '.org', '.net'); 我决定寻找地址的结尾（例如'.com'，'。org'，'。net'）; however, I am having difficulty in two areas when dealing with this. 但是，在处理这个问题时，我在两个方面遇到了困难。 (I have chosen this method as it is the best fit for the current project) （我选择了这种方法，因为它最适合当前项目）

Firstly I am trying to get around accidentally hindering users typing words with these keywords within them (eg. '"org"anisation', 'try this "or g"o to'). 首先，我试图不小心阻止用户在其中键入带有这些关键字的单词（例如'“org”anisation'，'试试这个'或g“o to'）。 How I have tackled this is, as an example, the regex: 作为一个例子，我如何处理这个正则表达式：

org(?!\\w) - To skip the match if there are letters directly after the keyword. org(?!\\w) - 如果关键字后面有字母，则跳过匹配。

The secondary problem is finding extra parts of an address (eg. 'www.website."org".uk') which would not be matched. 第二个问题是找到一个不匹配的地址的额外部分（例如'www.website。“org”.uk'）。 To combat this, as an example, I have used the regex: 为了解决这个问题，作为一个例子，我使用了正则表达式：

org((\\W*|\\.|dot)\\w\\w) - In an attempt to find the first two letters after the keyword, as most extensions are only two letters. org((\\W*|\\.|dot)\\w\\w) - 试图找到关键字后面的前两个字母，因为大多数扩展只有两个字母。

The Main Problem: 主要问题：

In order to prevent both of the above situations I have used the regex akin to: 为了防止上述两种情况，我使用的正则表达式类似于：

org(.|dot)\\w\\w|(?!\\w)

However, I am not as versed as I would like to be in Regex to find a solution and I understand that this would not create correct results. 但是，我并不像我想在Regex中找到解决方案那样精通，我理解这不会产生正确的结果。 I know there is a form of 'If this then that' within Regex but I just cant seem to understand the online documentation I have found on the subject. 我知道在Regex中有一种'If this then that'但我似乎无法理解我在这个主题上找到的在线文档。

If possible would someone be able to explain how I may go about creating a system to say: 如果可能，有人能够解释我如何创建一个系统来说：

IF: NOT org(\\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\\w\\w ELSE: MATCH org

I would really appreciate any help on the matter, this has been on my mind for a while now. 我真的很感激有关此事的任何帮助，这已经在我脑海中暂时停留了一段时间。 I would just like to see it through, but I just do not possess the required knowledge. 我只是希望看到它，但我只是没有所需的知识。

Edit: 编辑：

Test cases the Regex would need to pass (Specifically for the 'org' regex for these examples): 正则表达式需要传递的测试用例（特别是针对这些示例的'org'正则表达式）：

(I have marked matches in square brackets '[ ]', and I have marked possible matches to be disregarded with '< >' ) （我在方括号'[]'中标记了匹配项，并且我已标记可能的匹配项被忽略为'<>'）

"Hello, please come and check out my website: www.website.[org]"
"I have just uploaded a new game at games.[org.uk]"
"If you would like quote please email me at email@email.[org.ru]"
"I have just made a new <org>anisation website at website.[org], please get in contact at name.name@email.[org.us]"
"For more info check info.[org] <or g>o to info.[org.uk]"

I hope this allows for a better insight to what the Regex needs to do. 我希望这可以更好地了解Regex需要做什么。

Answer 1

The following regex: 以下正则表达式：

(?i)(?<=\.)org(?:\.[a-z]{2})?\b

should do the work for you. 应该为你做的工作。

demo: 演示：

https://regex101.com/r/8F9qbQ/2/ https://regex101.com/r/8F9qbQ/2/

explanations: 解释：

(?i) to activate the case as insensitive ( .ORG or .org ) (?i)将案例激活为不敏感（ .ORG或.org ）
(?<=.) forces that there is a . （？<=。）强迫有一个. before org to avoid matches when org is actually a part of a word. 在org实际上是一个单词的一部分时，在org之前避免匹配。
org to match ORG or org org ORG或org
(?:...)? non capturing group that can appear 0 to 1 time 非捕获组，可以出现0到1次
\\.[a-zA-Z]{2} dot followed by exactly 2 letters \\.[a-zA-Z]{2}点后跟恰好2个字母
\\b word boundary constraint \\b字边界约束

Answer 2

There are some other simpler way to catch any website, but assuming that you exactly need the feature IF: NOT org(\\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\\w\\w ELSE: MATCH org , then you can use: 还有一些其他更简单的方法可以捕获任何网站，但假设你确实需要这个特征IF: NOT org(\\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\\w\\w ELSE: MATCH org ，然后你可以使用：

org(?!\\w)(\\.\\w\\w)?

It will match: "org.uk" of www.domain.org.uk "org" of www.domain.org 它将匹配：的“org.uk” www.domain.org.uk的“组织” www.domain.org

But will not match www.domain.orgzz and orgzz 但是不会匹配www.domain.orgzz和orgzz

Explanation: The org(?!\\w) part will match org that is not followed by a letter character. 说明： org(?!\\w)组成部分将匹配org是后面没有字母字符。 It will match the org of org , org of org. 它将与org的org org相匹配org. but will not match orgzz . 但不会与orgzz匹配。

Then, if we already have the org , we will try if we can match additional (\\.\\w\\w) by adding the quantifier ? 然后，如果我们已经有了org ，我们会尝试通过添加量词来匹配其他(\\.\\w\\w) ? which means match if there is any, which will match the \\.uk but it is not necessary. 这意味着匹配，如果有，将匹配\\.uk但没有必要。

Answer 3

I made a little regex that captures a website as long as it starts with 'www.' 只要以'www.'开头，我就制作了一个捕获网站的小正则表达式'www.' that is followed by some characters with a following '.' 接着是一些带有后续'.'字符'.' . 。

import re 

matcher = re.compile('(www\.\S*\.\S*)') #matches any website with layout www.whatever
string = 'they sky is very blue www.harvard.edu.co see nothing else triggers it, www, org'
match = re.search(matcher, string).group(1)
#output
#'www.harvard.edu.co'

Now you can tighten this up as needed to avoid false positives. 现在你可以根据需要收紧它，以避免误报。

正则表达式 - 如果不匹配则匹配 - Python

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-07-05 02:36:15

解决方案2
1 2019-07-05 01:14:23

解决方案3
0 2019-07-05 02:14:31

正则表达式 - 如果不匹配则匹配 - Python

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-07-05 02:36:15

解决方案2 1 2019-07-05 01:14:23

解决方案3 0 2019-07-05 02:14:31

解决方案1
2 已采纳 2019-07-05 02:36:15

解决方案2
1 2019-07-05 01:14:23

解决方案3
0 2019-07-05 02:14:31