简体   繁体   中英

speed up a HTTP request python and 500 error

I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).

The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages. Here is the code:

    import requests
    import re
    from bs4 import BeautifulSoup
    import csv

    URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'


    def run(**params):
        countryFile = open("EgyptDaybyDay.csv","a")
        i=1
        results = True
        while results:
                    params["index"]=str(i)
                    response = requests.get(URL.format(**params))
                    print response.status_code
                    htmlFile = BeautifulSoup(response.content)
                    articles = htmlFile.findAll("div", { "class" : "newslist" })

                    for article in articles:
                                url =  (article.a['href']).encode('utf-8','ignore')
                                title = (article.img['alt']).encode('utf-8','ignore')
                                dateline = article.find("div",{"class": "floatright"})
                                m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
                                date =  m.group(1)
                                w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
                                w.writerow((date, title, url ))

                    if not articles:
                                results = False
                    i+=1
        countryFile.close()


    run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this

This is a good opportunity to try out gevent .

You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.

You can then spawn multiple workers and have queues to pass requests and articles around. Maybe something similar to this:

import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()

MAX_REQUESTS = 10

requests = Queue(MAX_REQUESTS)
articles = Queue()

mock_responses = range(100)
mock_responses.reverse()

def request():
    print "worker started"
    while True:
        print "request %s" % requests.get()
        sleep(1)

        try:
            articles.put('article response %s' % mock_responses.pop())
        except IndexError:
            articles.put(StopIteration)
            break

def run():
    print "run"

    i = 1
    while True:
        requests.put(i)
        i += 1

if __name__ == '__main__':
    for worker in range(MAX_REQUESTS):
        gevent.spawn(request)

    gevent.spawn(run)
    for article in articles:
        print "Got article: %s" % article

It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.

If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.

You can try making all the calls asynchronously .

Have a look at this : http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html

You could use gevent as well rather than twisted but just telling you the options.

This might very well come close to what you're looking for.

Ideal method for sending multiple HTTP requests over Python? [duplicate]

Source code: https://github.com/kennethreitz/grequests

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM