简体   繁体   中英

Itertools to speed up nested loops in beautiful soup

This is the code written in Python 3 and it works. But running it with four nested loops and more inside it's super slow.

How can I implement itertools to speed up the loops a bit?

For 25 rows with 4 columns of data this takes around 20 seconds.

import bs4 as bs
import urllib.request
import time

start_time = time.time()

a=[]
b=[]
c=[]
d=[]

for z in range(1,10):
    source = urllib.request.urlopen(f'https://X.com/id={z}').read()
    soup = bs.BeautifulSoup(source,'html.parser')

    for i in range(0,50):
        for name in soup.find_all('span',id=f"tblRightHolders:{i}:cellRHSurnameName"):
            a.insert(i,name.string)
        for city in soup.find_all('span',id=f"tblRightHolders:{i}:cellRHPlace"):
            b.insert(i,city.string)
        for street in soup.find_all('span', id=f"tblRightHolders:{i}:cellRHStreet"):
            c.insert(i,street.string)
        for number in soup.find_all('span', id=f"tblRightHolders:{i}:cellRHNumber"):
            d.insert(i,number.string)
    
X = [list(e) for e in zip(a, b, c, d)]
for nested in X:
    print(" - ".join(map(str, nested)))

print("--- %s seconds ---" % (time.time() - start_time))

The data get's output like this:

Name/Surname - City - Street - Street number

I do not think that itertools will speed it up - they can just provide nicer approach for more readable code. If you want to speed it up, there are several options:

  1. Use joblib for parallelization
  2. Try using some just-in-time compiler such as numba but for that you would have to probably rewrite the code as the soup code won't most likely be complient with numba
  3. rewrite the critical code in C/C++, rust or cython

Those last two are most likely overkills. Go with simple parallelism using joblib if you can (ie, have multiple cores available). Itertools won't help you with speeding it up, they can just make your code nicer.

[edit] I do recommend timing your code first. If your code spends most of the time with downloading the pages, you can just go with joblib but using threads instead of processes. I was just today doing something similar with 100 separate threads for webpage requests.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM