简体   繁体   English

Python Scrapy - 从多个网站 URL 中抓取数据

[英]Python Scrapy - Scraping data from multiple website URLs

For one of my web project I need to scrape data from different web sources.对于我的一个网络项目,我需要从不同的网络资源中抓取数据。 To keep it simple i am explaining with an example.为了简单起见,我用一个例子来解释。

Lets say i want to scrape the data about mobiles listed in their manufacturer site.假设我想抓取有关其制造商网站中列出的手机的数据。

http://www.somebrand1.com/mobiles/ . http://www.somebrand1.com/mobiles/ . . http://www.somebrand3.com/phones/ http://www.somebrand3.com/phones/

I have huge list of URLs.我有大量的 URL 列表。 Every brand's page will have their own way of HTML presentation for browser.每个品牌的页面都有自己的浏览器 HTML 呈现方式。

How can i write a normalized script to traverse the HTML of those listing web page URLs and scrape the data irrespective of the format they are in?我如何编写规范化脚本来遍历那些列出网页 URL 的 HTML 并抓取数据,而不管它们的格式如何?

Or else do i need to write a script to scrape data from every pattern?或者我是否需要编写一个脚本来从每个模式中抓取数据?

This is called a Broad Crawling and, generally speaking, this is not an easy thing to implement because of the different nature, representation, loading mechanisms web-sites use.这称为广泛爬网,一般来说,由于网站使用的不同性质、表示方式、加载机制,这不是一件容易实现的事情。

The general idea would be to have a generic spider and some sort of a site-specific configuration where you would have a mapping between item fields and xpath expressions or CSS selectors used to retrieve the field values from the page.一般的想法是拥有一个通用蜘蛛和某种特定于站点的配置,您将在项目字段和 xpath 表达式或用于从页面检索字段值的 CSS 选择器之间建立映射。 In a real life, things are not that simple as it seems, some fields would require post-processing, other fields would need to be extracted after sending a separate request etc. In other words, it would be very difficult to keep generic and reliable at the same time .在现实生活中,事情并不像看起来那么简单,有些字段需要进行后处理,有些字段需要在发送单独的请求后提取等。换句话说,很难保持通用性和可靠性同时

The generic spider should receive a target site as a parameter , read the site-specific configuration and crawl the site according to it.通用蜘蛛应该接收一个目标站点作为参数,读取站点特定的配置并根据它抓取站点。

Also see:另见:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM