简体   繁体   中英

How can I optimize a web-scraping code snippet to run faster?

I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.

Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get ?

import json
import requests
import pandas as pd
from pandas.io.json import json_normalize

headers = {}
p = {}

a = int(p['page'])
df = pd.DataFrame()
while True:
    p['page'] = str(a)
    try:
        a += 1
        r = requests.get('URL',headers=headers, params=p)
        complete_json = r.json()
        print('success')
        df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
        df = df.append(df_data)

    except:
        False

df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)

There are a couple of optimizations I can see right off the bat.

The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends . If you go this route you'll need to more explicitly define a terminating condition rather than a try / except block inside an infinite while loop.

This is all pseudo-code primarily ripped from their examples, but you can see how this might work:

from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time

def response_hook(resp, *args, **kwargs):
    with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
        parsed = resp.json()
        fp.write(json.dumps(parsed).encode('utf-8'))


futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook


with futures_session as session:
    futures = [
        session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
    ]
    for future in as_completed(futures):
        resp = future.result()

The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM