简体繁体 English

如何以编程方式从其他网站抓取图像？

[英]How can I programmatically scrape an image from another website?

原文 2010-03-04 14:14:26 5 4 javascript/ html/ image

A few years ago I helped someone put together a webpage (for local personal use only, not served to the world) that aggregates outdoor webcam photos from several of his favorite websites. 几年前，我帮助某人整理了一个网页（仅限本地个人使用，不向世界提供），该网页汇集了他最喜欢的几个网站上的户外网络摄像头照片。 It's a time-saver for viewing multiple websites at once. 这可以节省一次查看多个网站的时间。 We had it easy when the images on those websites had fixed URLs. 当这些网站上的图片有固定的网址时，我们很容易。 And we were able to write some JavaScript code when the URLs changed predictably (eg, when the url had a date it in). 当URL可预测地改变时（例如，当URL具有日期时），我们能够编写一些JavaScript代码。 But now he'd like to add an image whose filename changes seemingly at random and I don't know how to handle that. 但现在他想添加一个图像，其文件名似乎随机变化，我不知道如何处理。 Basically, I'd like to: 基本上，我想：

Programmatically visit another website to find the URL of a particular image. 以编程方式访问另一个网站以查找特定图像的URL。
Insert that URL into my webpage with an <img> tag. 使用<img>标记将该网址插入我的网页。

I realize this is probably a confusing and unusual question. 我意识到这可能是一个令人困惑和不寻常的问题。 I'm willing to help clarify as much as possible. 我愿意尽可能地帮助澄清。 I'm just not sure how to ask for what this guy wants to do. 我只是不确定如何询问这家伙想做什么。

Update: David Dorward mentioned that doing this with JavaScript violates the Same Origin Policy . 更新： David Dorward提到使用JavaScript执行此操作违反了同源策略。 I'm open to suggestions for other ways to approach this problem. 我愿意接受有关解决此问题的其他方法的建议。

4 个解决方案

Its probably a big fat violation of copyright. 它可能是一个严重侵犯版权的行为。

The picture is most like containered within a page - just regularly visit that page and parse the img tag. 图片最像是在页面中包含容器 - 只需定期访问该页面并解析img标记。 Make sure that the random bit you commented on is not just a random parameter to force browsers to fetch the fresh image instead of retrieving a cached version. 确保您评论的随机位不仅仅是一个随机参数，以强制浏览器获取新图像而不是检索缓存版本。

Fetch html of remote page using Cross Domain AJAX . 使用跨域AJAX获取远程页面的html。
Then parse it to get urls of images of interest. 然后解析它以获取感兴趣的图像的URL。
Then for each url do <img src=url /> 然后for each url do <img src=url />

如果你在你的项目中使用php，你可以使用CURL库获取另一个网站内容，并使用正则表达式解析它从源代码获取图像URL。

You have a Python question in your profile, so I'll just say if I were trying to do this, I'd go with Python & Beautiful Soup . 你的个人资料中有一个Python问题，所以我只想说如果我试图这样做，我会选择Python和Beautiful Soup 。 Has the added advantage of being able to handle invalid HTML. 具有能够处理无效HTML的附加优势。