简体繁体 English

如何抓取网站，客户端或服务器端？

[英]How to Scrape websites, client side or server side?

原文 2009-04-05 14:27:37 4 5 .net/ asp.net/ javascript

I am creating a bookmarklet button that, when the user clicks on this button in his browser, will scrape the current page and get some values from this page, such as price, item name and item image. 我正在创建一个书签按钮，当用户在浏览器中单击该按钮时，它将抓取当前页面并从该页面获取一些值，例如价格，项目名称和项目图像。

These fields will be variable, means that the logic of getting these values will be different for each domain "amazon, ebay" for example. 这些字段将是可变的，例如，对于每个域“ amazon，ebay”，获取这些值的逻辑将有所不同。

My questions are: 我的问题是：

Should i use javascript to scrape these data then send to the server? 我应该使用javascript抓取这些数据然后发送到服务器吗？
Or just send to my server side the URL then use .net code to scrape values? 还是只是将URL发送到我的服务器端，然后使用.net代码抓取值？
What is the best way? 什么是最好的方法？ and why its better? 为什么更好呢？ advantages, disadvantages? 优点缺点？

Look at this video and you will understand what i want to do exactly http://www.vimeo.com/1626505 观看此视频，您将了解我想确切执行的操作http://www.vimeo.com/1626505

5 个解决方案

If you want to pull information from another site for use in your site (written in ASP.NET, for example) then you'll typically do this on the server side so that you have rich language for processing the results (eg C#). 如果要从另一个站点中提取信息以在您的站点中使用（例如，用ASP.NET编写），则通常会在服务器端执行此操作，以便拥有丰富的语言来处理结果（例如C＃）。 You'll do this via a WebRequest object in .NET. 您将通过.NET中的WebRequest对象来执行此操作。

The primary use of client side processing is to use Javascript to pull information to display on your site. 客户端处理的主要用途是使用Javascript提取信息以在您的站点上显示。 An example would be the scripts provided by the Weather Channel to show a little weather box on your site or for very simple actions such as adding a page to favorites. 例如，天气频道提供的脚本可以在您的站点上显示一个小的天气框，也可以用于非常简单的操作，例如将页面添加到收藏夹。

UPDATE : Amr writes that he is attempting to recreate the functionality of some popular screen scraping software which would require some quite sophisticated processing. 更新：Amr写道，他正在尝试重新创建某些流行的屏幕抓取软件的功能，而这需要进行相当复杂的处理。 Amr, I'd consider creating an application that uses the IE browser object to display web pages - it is quite simple. Amr，我考虑创建一个使用IE浏览器对象显示网页的应用程序-这非常简单。 You could then just pull the InnerHTML (I think, it has been a few years since I implemented an IE-object-based program) to retrieve the contents of the page and do your magic. 然后，您可以拉InnerHTML（我想，自从我实现了基于IE对象的程序以来已经有几年了）来检索页面的内容并发挥作用。 You could, of course, use a WebRequest object (just handing it the URL used in the browser object) but that wouldn't be very efficient as it would download the page a second time. 当然，您可以使用WebRequest对象（只是将其传递给浏览器对象中使用的URL），但这并不是很有效，因为它将第二次下载页面。

Is this what you are after? 这是你所追求的吗？

If you want to use only JavaScript to do this, you are liable to have a fairly large bookmarklet unless you know the exact layout of every site it will be used on (and even then it will be big). 如果您只想使用JavaScript来执行此操作，则除非您知道将在其上使用它的每个站点的确切布局（即使那样大），否则您都有一个相当大的书签。

A common way I have seen this done is to use a web service on your own server that your bookmarklet (which uses JavaScript) redirects to along with some parameters, like the URL of the page you are viewing. 我看到的一种常见方法是在您自己的服务器上使用Web服务，您的书签（使用JavaScript）将重定向到该Web服务以及一些参数，例如您正在查看的页面的URL。 Your server would then scrape the page and do the work of parsing the HTML for the things you are interested in. 然后，您的服务器将抓取页面，并为您感兴趣的内容解析HTML。

A good example is the "Import to Mendeley" bookmarklet, which passes the URL of the page you are visiting to its server where it then extracts information about scientific papers listed on the page and imports them into your collection. 一个很好的例子是“导入到Mendeley”书签，该书签将您正在访问的页面的URL传递到其服务器，然后在服务器中提取有关页面上列出的科学论文的信息，并将其导入您的收藏中。

I would scrape it on the server side, because (i'am Java guy) i like static languages more then dynamic script languages, so maintaining the logic at the backend would be more comfortable to me. 我会在服务器端抓取它，因为（我是Java专家）我更喜欢静态语言而不是动态脚本语言，因此在后端维护逻辑对我来说更舒服。 On the other side depends on how many items you want to scrape and how complex the logic for this would be. 另一方面，取决于要刮取的项目数以及此逻辑的复杂程度。 Perhaps the values are parseable with a single id selector in JavaScript, then server side processing could be overkill. 也许可以使用JavaScript中的单个id选择器来解析这些值，然后服务器端处理可能会过大。

Bookmarklets are client-side per definition, but you could have the client depend on a server, but your example doesn't provide enough information. 书签是按定义在客户端的，但是您可以让客户端依赖服务器，但是您的示例没有提供足够的信息。 What do you want to do with the scraped info? 您要如何处理已抓取的信息？

If you include the scraping code in the bookmarlet your users will have to update their bookmark if you include new functionality or bug-fixes. 如果您在手册中包含抓取代码，则在您包含新功能或错误修复时，用户将必须更新其书签。 Do it server-side and all your users get the new stuff instantly :) 在服务器端进行操作，您的所有用户都会立即获得新内容：)