简体   繁体   中英

Performance issues while scraping website data with Python

I am trying to scrap data with Python from a website that contains around 4000 pages which consist of 25 links per page.

My problem is that after around 200 processed pages the performance gets so horrendous that even other programs on my computer freeze.

I guess it is something about me not working with the memory correctly or something similiar. I would appreciate it greatly if someone could help me out on this matter to get my script running more smoothly and less demanding to my system.

Thanks in advance for every help. :)

EDIT: I found the solution you can find it in the answer i gave when you scroll down a bit. Thanks to everyone that tried to help me, especially etna and Walter A that gave good suggestions for me to get on the right track. :)

from pprint import pprint
from lxml import etree
import itertools
import requests

def function parsePageUrls(page):
    return page.xpath('//span[@class="tip"]/a/@href')

def function isLastPage(page):
    if not page.xpath('//a[@rel="next"]'):
        return True

urls = []
for i in itertools.count(1):
    content = requests.get('http://www.example.com/index.php?page=' + str(i), allow_redirects=False)
    page = etree.HTML(content.text)

    urls.extend(parsePageUrls(page))

    if isLastPage(page):
        break

pprint urls

I finally found the solution. The problem was that i thought i work with a list of strings as return value of tree.xpath, but instead it was a list of _ElementUnicodeResult-Objects that blocked the GC from clearing the memory because they held references to their parent.

So the solution is to transform these _ElementUnicodeResult-Objects into a normal string to get rid of the references.

Here is the source that helped me out understanding the issue: http://lxml.de/api/lxml.etree._ElementTree-class.html#xpath

As for the provided code the following fixed it:

Instead of:

urls.extend(parsePageUrls(page))

It had to be:

  for url in parsePageUrls(page):
    urls.append(str(url))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM