简体   繁体   中英

Parsing query string using regular expression in python

I am trying to parse a url string using RE, here is my pattern qid=(.*?)&+? it does find the query string but if there is no & at the end of the url then it fails!

please take a look at the pythex.org page where i am trying to achieve the value of the query string for "qid".

You can (and probably should) solve it with urlparse instead:

>>> from urlparse import urlparse, parse_qs
>>> s = "https://xx.com/question/index?qid=2ss2830AA38Wng"
>>> parse_qs(urlparse(s).query)['qid'][0]
'2ss2830AA38Wng'

As for the regular expression approach, you can check if there is & or the end of the string:

qid=(.*?)(?:&|$)

(?:...) here is a non-capturing group .

I agree with @alecxe that this is best handled with a urlparse . However, here are are some re options. The main trick is using the lookbehind, (?<=...) and lookahead, (?=...) assertions.

The general pattern is: return something with 'qid=' behind it, and zero or one '&' ahead of it: '(?<=qid=) some_pattern (?=&)?'

If you disable multiline, and then process the urls individually, this will work for any values of the qid variable: '(?<=qid=)([^&]*)(?=&)?'

However, if you have to use multiline mode, then you need to also avoid matching the newline characters. Let's assume it is '\\n' (but of course, different encodings use different newline characters). Then you could use: '(?<=qid=)([^&\\n]*)(?=&)?'

And lastly, if you are sure your qid variable will only store alpha-numerica values, you could avoid the uncertainty about the newline character, and just match alphanumeric values: '(?<=qid=)([A-Za-z0-9]*)(?=&)?'

import re

# Single line version
s_1 = 'https://xx.com/question/index?qid=2ss2830AA38Wng'
s_2 = 'https://xx.com/question/index?qid=2ff38Wng&a=aubb&d=ajfbjhcbha'
q_1 = '(?<=qid=)([^&]*)(?=&)?'

print re.findall(q_1, s_1)
print re.findall(q_1, s_2)

# Multiline version V1
s_m = s_1 + '\n' + s_2
q_m = '(?<=qid=)([^&\n]*)(?=&)?'

print re.findall(q_m, s_m)

# Multiline version V2
q_m_2 = '(?<=qid=)([A-Za-z0-9]*)(?=&)?'

print re.findall(q_m_2, s_m)

Running this prints:

Single Line Verison
['2ss2830AA38Wng']
['2ff38Wng']

Multiline version V1
['2ss2830AA38Wng', '2ff38Wng']

Multiline version V2
['2ss2830AA38Wng', '2ff38Wng']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM