简体   繁体   中英

Why is this Python regex negative look ahead not working?

I am trying to collect a set of URLs, using BeautifulSoup, with a very specific criteria. The URLs I want to collect must contain /b-\\d+ ( /b- followed by a series of numeric values). However, I want to ignore all URLs containing View%20All even if it has /b-\\d+ in it. Here are a sample of URLs:

1. http://www.foo.com/bar/b-12312903?sName=View%20All
2. http://www.foo.com/bar/b-832173712873?sName=View%20All
3. http://www.foo.com/bar/b-1208313109283129
4. http://www.foo.com/bar/b-2198123371239489?adCell=W3

Given the above sample, the valid URLs that I want to collect are #3 and #4. I have tried using different negative lookahead regular expressions and they just aren't working for me:

{"href" : re.compile(r"\/b-\d+.+(?!View\%20All)")}
{"href" : re.compile(r"^.+\/b-\d+.+(?!View\%20All$)")}

Can someone tell me what I am doing wrong?

{"href" : re.compile(r"\/b-\d+.+(?!View\%20All)")}
{"href" : re.compile(r"^.+\/b-\d+.+(?!View\%20All$)")}

where you got wrong?

when we give (?!View\\%20All) it asserts that the View\\%20All cannot be matched immediately following the previous pattern which is .+

in effect it means that the look ahead is always true

to illustrate lets check what is matched at by each pattern

http://www.foo.com/bar/b-12312903?sName=View%20All

/b- is obvious

\\d matches 12312903

now the problem arises,

.+ matches anything such that it makes the negative assertion (?!View\\%20All) successful.

that is say

. matches ?s string that is left unmatched is sName=View%20All which doesn't match (?!View\\%20All) at the beginning position s hence always successful matching lines 1 and line 2

demo to get a clear image.

Fix??

when using lookaround assertions, fix the positions from where the checking starts

say using a regex like

(\/b-\d+)(\?|$)(?!sName=View\%20All)

which will match 3 and 4 as

http://regex101.com/r/aS5yS2/1

here ? or $ within the string fixes the position from where the negative assertion starts.

^.*?/b-\d+(?:(?!View%20All).)*$

Demo

Or much faster

^.+?/b-\d+(?:[^V]+|V(?!iew%20All))*$

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM