For one of my web project I need to scrape data from different web sources. To keep it simple i am explaining with an example.
Lets say i want to scrape the data about mobiles listed in their manufacturer site.
http://www.somebrand1.com/mobiles/ . . http://www.somebrand3.com/phones/
I have huge list of URLs. Every brand's page will have their own way of HTML presentation for browser.
How can i write a normalized script to traverse the HTML of those listing web page URLs and scrape the data irrespective of the format they are in?
Or else do i need to write a script to scrape data from every pattern?
This is called a Broad Crawling and, generally speaking, this is not an easy thing to implement because of the different nature, representation, loading mechanisms web-sites use.
The general idea would be to have a generic spider and some sort of a site-specific configuration where you would have a mapping between item fields and xpath expressions or CSS selectors used to retrieve the field values from the page. In a real life, things are not that simple as it seems, some fields would require post-processing, other fields would need to be extracted after sending a separate request etc. In other words, it would be very difficult to keep generic and reliable at the same time .
The generic spider should receive a target site as a parameter , read the site-specific configuration and crawl the site according to it.
Also see:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.