简体繁体 English

在Django / Python中从网站抓取图像的有效方法

[英]Efficient way to scrape images from website in Django/Python

原文 2018-05-09 09:56:18 6 1 python/ django/ selenium

First I guess I should say I am still a bit of a Django/Python noob. 首先，我想我应该说我还是Django / Python新手。 I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. 我在一个允许用户输入URL的项目中，该网站从该页面抓取内容并返回一定大小的图像和页面标题标签，以便用户随后可以选择要在其上使用的图像轮廓。 A pretty standard scenario I assume. 我认为这是一个非常标准的方案。 I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. 我通过使用Selenium（无头Chrome浏览器）来抓取目标页面内容，使用一些python确定文件大小，然后在我的Django视图中将所有内容都放入模板中来进行工作。 I then have it coded in such a way that the image the user selects will be downloaded and stored locally. 然后，我将其编码为用户选择的图像将在本地下载和存储。

However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. 但是，我严重怀疑它的可伸缩性，它目前仅在本地运行，并且我非常担心如果有很多用户同时运行，这将如何应对。 I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. 每次发出听起来效率不高的请求时，我都会启动该无头chrome浏览器，我必须下载该图像以确定其大小，以便可以确定它是否足够大。 One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds. 从我提交URL到向用户显示结果花了12秒，而通过www.kit.com放置的相同目标URL（它们具有非常相似的Web抓取功能）花费了3秒。

I have not provided any code as the code I have does what it should, I think the approach however is incorrect. 我没有提供任何代码，因为我所拥有的代码应做的事，但是我认为这种方法是不正确的。 To summarise what I want is: 总结一下我想要的是：

To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title. 为了允许用户输入URL并使其返回该页面上超过一定大小（宽度/高度）的所有图像（或仅这些图像的URL）和页面标题。
For this to be the most efficient solution, taking into account it would be run concurrently between many users at once. 为了使它成为最有效的解决方案，考虑到它将在多个用户之间同时并发运行。
For it to work in a Django (2.0) / Python (3+) environment. 为了使其能够在Django（2.0）/ Python（3+）环境中工作。

I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option. 如果存在第三方服务，我并不完全反对使用该API，但这是我最不喜欢的选择。

Any help/pointers would be much appreciated. 任何帮助/指针将不胜感激。

1 个解决方案

You can use 2 python solutions in your case: 您可以使用2种python解决方案：
1) BeautifulSoup , and here is a good answer how to download the images using it. 1） BeautifulSoup ，这是一个很好的答案，如何使用它下载图像。 You just have to make it a separate function and pass site as the argument into it. 您只需要使其成为一个单独的函数，然后将site作为参数传递给它即可。 But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). 但是，也很容易按照您所说的那样仅解析图像链接 -取决于您所需的速度（显然，抓取文件（特别是当文件数量很多时，抓取文件的速度）比链接要慢得多）。 This tool is just for parsing and scrapping the content of the page. 该工具仅用于解析和抓取页面内容。

2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. 2） Scrapy-这是功能更强大的工具，框架，通过它，您可以将Spider连接到Django模型，并使用其内置的图像管道更加有效地处理图像。 It is much more flexible with a lot of features how to operate with scrapped data. 它具有许多功能，可以更灵活地处理报废的数据。 I am not sure if u need to use it in your project and if it is not overpowered in your case. 我不确定您是否需要在您的项目中使用它，并且在您的情况下它是否没有被超越。

Also my advice is to run the spider in some background task like Queue or Celery , and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response. 另外，我的建议是在某些后台任务（例如Queue或Celery ）中运行蜘蛛程序，并通过AJAX调用结果，因为解析内容可能需要一些时间-因此不要让用户等待响应。

PS You can even combine those 2 tools in some cases :) PS在某些情况下，您甚至可以结合使用这两种工具：)