简体   繁体   中英

How to speed up requests module while scraping?

Firstly, I'm not asking this question because I cannot find an answer, but because I cannot understand the answers that I've found.

It's very easy for people to answer a question thinking, that "I answered your question, if you don't understand it it's your own fault", so now I need some help in understanding, or just in simplifying the process.

I have a list of about 300,000 urls that I am visiting using pythons requests module. the time it takes to get/load the url is quite painful which I believe is because of the amount of content located at the url. I'm probably at 15-20 seconds per request. I'm trying to think of any way that I can greatly reduce this amount of time.

My first thought was whether or not I could sort of disable/filter out images and anything else that I know ahead of time I wont be needing using requests. I'm not sure how to implement that or if it can even be done.

My second idea is to send "batch requests," which looks to me like sending multiple requests simultaneously. I'm really not sure if this is actually faster, I haven't been able to get an accurate response of my request since I can't get my bit of code to work. My assumption is that I can send X requests in one shot, get X responses in return, and just process each one individually. What I've attempted to use as a solution of this is below.

def getpage(list_urls):
    for url in list_urls:
        r = requests.get(url)
        dostuffwithresponse()

for file in list_files:
    list_links = open(file).readlines()
    pool = multiprocessing.Pool(processes = 10)
    pool_outputs = pool.map(getpage(), list_links)
    pool.close()
    pool.join()
    print('*')
    print(pool_outputs)

Between reducing the size of my response, if possible, and by sending multiple requests. It is my goal to shorten my 15 second+ wait time to 5 seconds and under (or as good as I can get it).

Does anyone have a good suggestion on a simpler, more direct way to go about this?

@OleksandrDashkov gave a link to a very helpful guide to be able to send millions of requests fairly efficiently with aiohttp and asyncio

I will attempt to condense that information to something that should help you with your problem.

I highly recommend that you look at the asyncio documentation and other blog posts, so you can have a good understanding of what it is before programming with it (Or read over the code and try to understand what it's doing).

We'll start with how a basic fetch works in aiohttp . It's quite similar to requests .

import asyncio

import aiohttp

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            dostuffwithresponse()  # To mimic your code.

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

# If you're on Python 3.7 :o
asyncio.run(main())

Fairly straightforward. If you used requests' session object, it should be almost identical other than the async syntax.

Now, we want to get a lot of URLs. We also don't want to recreate a session object every time.

async def fetch(session, url):
    async with session.get(url) as response:
        dostuffwithresponse()

async def main():
    async with aiohttp.ClientSession() as session:
        for file in list_files:
            for link in open(file).readlines():
                await fetch(session, url)

Now we are fetching all the URLs. It's still the same behavior, still synchronous because we're waiting for fetch() to complete before going to the next link.

async def fetch(session, url):
    ...

async def main():
    tasks = []
    async with aiohttp.ClientSession() as session:
        for file in list_files:
            for link in open(file).readlines():
                task = asyncio.ensure_future(fetch(session, url))
                tasks.append(fut)
        results = await asyncio.gather(*tasks)
    # results is a list of everything that returned from fetch().
    # do whatever you need to do with the results of your fetch function

Here I would suggest that you try to understand what asyncio.ensure_future() is and asyncio.gather() does. Python 3.7 have a new revamped documentation of this, and there's plenty of blog posts about this.

Lastly, you can't fetch 300,000 links concurrently. Your OS is most likely going to give you errors about how you can't open that many file descriptors or something similar to that regard.

So, you would solve this problem by using a Semaphore. For this case, you need to use asyncio.Semaphore(max_size) or asyncio.BoundedSemaphore(max_size)

async def fetch(session, url):
    ...

async def bounded_fetch(sem, url, session):
    async with sem:
        await fetch(url, session)

async def main():
    tasks = []
    sem = asyncio.Semaphore(1000)  # Generally, most OS's don't allow you to make more than 1024 sockets unless you personally fine-tuned your system. 
    async with aiohttp.ClientSession() as session:
        for file in list_files:
            for link in open(file).readlines():
                # Notice that I use bounded_fetch() now instead of fetch()
                task = asyncio.ensure_future(bounded_fetch(sem, session, url))
                tasks.append(fut)
        results = await asyncio.gather(*tasks)
    # do whatever you need to do with the results of your fetch function

Why is this all faster?

So, asyncio mostly works by when you send a request to a web server, you don't want to waste time waiting for a response. Instead, an event is created to tell the event loop when a response arrived. While you're waiting for 1 response to happen, you go ahead and make another request (aka ask the event loop for the next task), and continue.

I'm definitely not the best at explaining all this, but I hope this helped you get a basic grasp of how to speed up your webscraping. Good luck!

Edit: Looking back at this, you may have to add asyncio.sleep() for it to actually start while looping. But this code is also using open().readlines() which may block the event loop.

Sending out a bunch of asynchronous requests is the way to go. As @NinjaKitty mentioned, you could look into using aiohttp . I recently had to do something similar and found that it was easier for me to use requests_futures . You can set up a loop to make N asynchronous requests with a callback function for each. Then wait for all N to complete and continue with the next N.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM