简体   繁体   中英

Regex - If not match then match this - Python

I apologise for the amount of text, but I cannot wrap my head around this and I would like to make my problem clear.

I am currently attempting to create a regex expression to find the end of a website/email link in order to then process the rest of the address. I have decided to look for the ending of the address (eg. '.com', '.org', '.net'); however, I am having difficulty in two areas when dealing with this. (I have chosen this method as it is the best fit for the current project)

Firstly I am trying to get around accidentally hindering users typing words with these keywords within them (eg. '"org"anisation', 'try this "or g"o to'). How I have tackled this is, as an example, the regex:

org(?!\\w) - To skip the match if there are letters directly after the keyword.

The secondary problem is finding extra parts of an address (eg. 'www.website."org".uk') which would not be matched. To combat this, as an example, I have used the regex:

org((\\W*|\\.|dot)\\w\\w) - In an attempt to find the first two letters after the keyword, as most extensions are only two letters.

The Main Problem:

In order to prevent both of the above situations I have used the regex akin to:

org(.|dot)\\w\\w|(?!\\w)

However, I am not as versed as I would like to be in Regex to find a solution and I understand that this would not create correct results. I know there is a form of 'If this then that' within Regex but I just cant seem to understand the online documentation I have found on the subject.

If possible would someone be able to explain how I may go about creating a system to say:

IF: NOT org(\\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\\w\\w ELSE: MATCH org

I would really appreciate any help on the matter, this has been on my mind for a while now. I would just like to see it through, but I just do not possess the required knowledge.

Edit:

Test cases the Regex would need to pass (Specifically for the 'org' regex for these examples):

(I have marked matches in square brackets '[ ]', and I have marked possible matches to be disregarded with '< >' )

"Hello, please come and check out my website: www.website.[org]"
"I have just uploaded a new game at games.[org.uk]"
"If you would like quote please email me at email@email.[org.ru]"
"I have just made a new <org>anisation website at website.[org], please get in contact at name.name@email.[org.us]"
"For more info check info.[org] <or g>o to info.[org.uk]"

I hope this allows for a better insight to what the Regex needs to do.

The following regex:

(?i)(?<=\.)org(?:\.[a-z]{2})?\b

should do the work for you.

demo:

https://regex101.com/r/8F9qbQ/2/

explanations:

  • (?i) to activate the case as insensitive ( .ORG or .org )
  • (?<=.) forces that there is a . before org to avoid matches when org is actually a part of a word.
  • org to match ORG or org
  • (?:...)? non capturing group that can appear 0 to 1 time
  • \\.[a-zA-Z]{2} dot followed by exactly 2 letters
  • \\b word boundary constraint

There are some other simpler way to catch any website, but assuming that you exactly need the feature IF: NOT org(\\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\\w\\w ELSE: MATCH org , then you can use:

org(?!\\w)(\\.\\w\\w)?

It will match: "org.uk" of www.domain.org.uk "org" of www.domain.org

But will not match www.domain.orgzz and orgzz

Explanation: The org(?!\\w) part will match org that is not followed by a letter character. It will match the org of org , org of org. but will not match orgzz .

Then, if we already have the org , we will try if we can match additional (\\.\\w\\w) by adding the quantifier ? which means match if there is any, which will match the \\.uk but it is not necessary.

I made a little regex that captures a website as long as it starts with 'www.' that is followed by some characters with a following '.' .

import re 

matcher = re.compile('(www\.\S*\.\S*)') #matches any website with layout www.whatever
string = 'they sky is very blue www.harvard.edu.co see nothing else triggers it, www, org'
match = re.search(matcher, string).group(1)
#output
#'www.harvard.edu.co'

Now you can tighten this up as needed to avoid false positives.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM