简体   繁体   中英

Efficient way to scrape images from website in Django/Python

First I guess I should say I am still a bit of a Django/Python noob. I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. A pretty standard scenario I assume. I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. I then have it coded in such a way that the image the user selects will be downloaded and stored locally.

However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds.

I have not provided any code as the code I have does what it should, I think the approach however is incorrect. To summarise what I want is:

  • To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title.

  • For this to be the most efficient solution, taking into account it would be run concurrently between many users at once.

  • For it to work in a Django (2.0) / Python (3+) environment.

I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option.

Any help/pointers would be much appreciated.

You can use 2 python solutions in your case:
1) BeautifulSoup , and here is a good answer how to download the images using it. You just have to make it a separate function and pass site as the argument into it. But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). This tool is just for parsing and scrapping the content of the page.

2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. It is much more flexible with a lot of features how to operate with scrapped data. I am not sure if u need to use it in your project and if it is not overpowered in your case.

Also my advice is to run the spider in some background task like Queue or Celery , and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response.

PS You can even combine those 2 tools in some cases :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM