简体   繁体   中英

Match everything until optional string (Python regex)

I've pounded my head against this issue, and it just seems like I am missing something uber-trivial, so apologies in advance. I have a url, which may, or may not, contain some POST values. I want to match the the entire url UNTIL this optional part (not inclusive). So for example:

import re
myurl = r'http://myAddress.com/index.aspx?cat=ThisPartChanges&pageNum=41'
matchObj  =  re.match(r'(.*?)(&pageinfo=\d+){0,1}', myurl)
print matchObj.groups()
>> ('', None)

# Putting the non-greedy ? outside
matchObj  =  re.match(r'(.*)?(&pageinfo=\d+){0,1}', myurl)
print matchObj.groups()
>> ('http://myAddress.com/index.aspx?cat=ThisPartChanges&pageNum=41', None)

# The url might also be without the last part, that is
myurl = r'http://myAddress.com/index.aspx?cat=ThisPartChanges'
# I'd like the regex to capture the first part. "ThisPartChanges" might 
# be different every time

What I would like is to get the everything until pageNum=\\d+, not inclusive. That is

http://myAddress.com/index.aspx?cat=ThisPartChanges

I am only interested in the part before &pageNum, and don't care if it exists or not, just want to filter it out somehow so that I can get the real address until cat=....

I've tried all sorts of non-greedy acrobatics, but the part that fails me is that the 2nd part is optional, so there's nothing to 'anchor' the non-greedy match. Any ideas how to elegantly do this? Only the first part is important. Non-regex solutions are also welcome

Thanks!

you may want to take a look at https://docs.python.org/2/library/urlparse.html

the order in which parameters are passed may change:

?pageNum=41&cat=ThisPartChanges

I'd recommend you to avoid regular expressions when it comes to url parsing, use this module instead, here's a working example for your problem:

import urlparse

myurl = 'http://myAddress.com/index.aspx?cat=ThisPartChanges&pageNum=41'

parsed = urlparse.urlparse(myurl)

print 'scheme  :', parsed.scheme
print 'netloc  :', parsed.netloc
print 'path    :', parsed.path
print 'params  :', parsed.params
print 'query   :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port    :', parsed.port

print urlparse.parse_qs(parsed.query)

In your case, this could do:

^[^&]+

More robust:

^[^?]+\?cat=[^&]+

Example:

In [40]: s = 'http://myAddress.com/index.aspx?cat=ThisPartChanges&pageNum=41'

In [41]: re.search(r'^[^&]+', s).group()
Out[41]: 'http://myAddress.com/index.aspx?cat=ThisPartChanges'

In [42]: re.search(r'^[^?]+\?cat=[^&]+', s).group()
Out[42]: 'http://myAddress.com/index.aspx?cat=ThisPartChanges'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM