简体   繁体   中英

Operating On re.findall()

I was wondering if there was a way to do this better? I'd like to transform each object found into a string as I find it versus finding the whole list and then transforming each item in the list:

aList = regexObj.findall(s.text) if regexObj.findall(s.text) else None

self._menuUrls = map( lambda x: str( 'https://....' + x + '?otherparams=...' ), aList )

Is there a pre-made method I could use to do this in one pass or would this require that I create a separate method/lambda? Could I be more efficient in how I approach this?

EDIT: I did my own research into several methods with a file containing 500k matchable instances and found that list comprehension with re.findall() is 40-50% faster than list comprehension using re.finditer() in transforming an object as you search for an item.

menuUrls = []

start = time.time()

regex = re.compile("javascript:iframeLink\('([^']+)'\);")

#My Original Solution = 0.78200006485
menuUrls = map( lambda x: str('http://...' + x + '?param=...'), regex.findall(str(lines)))

#My Revised Solution = 0.619000196457
menuUrls = [ str('http://...' + x + '?param=...') for x in regex.findall(str(lines)) ]

#Friend's Proposal = 0.802000045776
for m in regex.finditer(str(lines)):
    menuUrls.append(str('http://...' + m.group(1) + '?param=...'))

#Stack Proposal = 0.912000179291
menuUrls = [ str('http://...' + x.group(0) + '?param=...') for x in regex.finditer(str(lines)) ]

set(menuUrls)

print time.time() - start

You are looking for re.finditer . Something like:

regex_iter = regexObj.finditer(s.text)
self._menuUrls = ['https://....' + x.group(0) + '?otherparams=...' for x in regex_iter]

This is marginal, but generally, a list comprehension will be faster than map with a lambda (indeed, than map with any other non-builtin function).

Demonstrations:

>>> import re
>>> text = "1 234 6 889 33 5 777 dff hd ae 2  ggre 777 fdf"
>>> pattern = re.compile(r"\d+")
>>> nums = ['<'+ m.group(0) + '>' for m in pattern.finditer(text)]
>>> nums
['<1>', '<234>', '<6>', '<889>', '<33>', '<5>', '<777>', '<2>', '<777>']
>>>
menuUrls = []

start = time.time()

regex = re.compile("javascript:iframeLink\('([^']+)'\);")

#My Original Solution = 0.78200006485
menuUrls = map( lambda x: str('http://...' + x + '?param=...'), regex.findall(str(lines)))

#My Revised Solution = 0.619000196457
menuUrls = [ str('http://...' + x + '?param=...') for x in regex.findall(str(lines)) ]

#Friend's Proposal = 0.802000045776
for m in regex.finditer(str(lines)):
    menuUrls.append(str('http://...' + m.group(1) + '?param=...'))

#Stack Proposal = 0.912000179291
menuUrls = [ str('http://...' + x.group(0) + '?param=...') for x in regex.finditer(str(lines)) ]

set(menuUrls)

print time.time() - start

The list comprehension of regex.findall() is tested to be the fastest search and transform function of the suggested solutions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM