简体繁体中英

crawling multiple webpages from a website

原文 2012-04-15 07:23:03 8 2 web-crawler/ dmoz

I want to extract data from a website. Say, URL is http://www.example.com/ . So I put this URL in start_urls (referring to the example of DMOZ in documentation). But I also want to create a GUI when I enter a string and click on a button it will append that string to the start_urls and extract all the pages which can be accessed like this http://www.example.com/computer/page-1 . So can you please tell me how can I do this using loop? I have tried putting more URLs in start_urls manually to check if it works but it doesn't respond well. Sometimes it gets no response. Any thoughts on that?

2 answers

How you could do this using a loop?

Friend, that would be some loop. Seriously, I would consider looking into existing Open-Source scripts and applications that do this. You would easily be able to see and have an idea how it can be done . Then of course, you can make whatever you feel like better , all you want. I am quite certain there are many many examples of web spidering solutions available out there. With my limited toolset, I would probably try hacking something with wget controlled via a bash or perl script of some sort, but that is me and is not necessarily favorable to many people.

As for the 'task' itself, if you really want to code it up yourself, consider splitting in sub tasks, Some would see 2 applications doing this task. For example you could have one application could store the links and the other be the 'fetcher', the spider.

And try not to think in terms of 'loops'. There are no loop yet at this stage of your project.

If you are on Linux or have Cygwin / GnuTools installed for windows, like I was hinting I strongly suspect wget may be scripted to do this, go through a list of text links and fetch css, images and maybe even js.

Of course, once all this is working fine from the command line,then you would maybe want a front end to access this in a friendly manner. Again depending on the language / technology stack you use, you will have different options. That is another topic I won't get into.

Hope this helps, cheers!

In a nutshell, you can search for existing Open-Source web spidering ressources on Sourceforge, git-hub, google, etc.

Depending of your needs, Netwoof can do it for you. Can loop on links, multiple resutls pages etc... It is fully automated, generate API and can even qualify unstrutured data in structured data.

crawling multiple webpages from a website

Crawling images from a website

PHP crawling data from website

How to crawl data from the linked webpages on a webpage we are crawling

multiple thread website crawling using python

Scrapy: How to handle if a website is blocked from crawling

web crawling from BBC website using XPath

Python Crawling Pastebin (JavaScript rendered webpages)

Trouble crawling/scraping webpages that use javascript with Perl

crawling / scraping a search form based webpages

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question crawling multiple webpages from a website Crawling images from a website PHP crawling data from website How to crawl data from the linked webpages on a webpage we are crawling multiple thread website crawling using python Scrapy: How to handle if a website is blocked from crawling web crawling from BBC website using XPath Python Crawling Pastebin (JavaScript rendered webpages) Trouble crawling/scraping webpages that use javascript with Perl crawling / scraping a search form based webpages

Related Tags

crawling multiple webpages from a website

Question

2 answers

solution1
0 2012-04-15 07:48:28

solution2
0 2014-04-17 18:03:34

crawling multiple webpages from a website

Question

2 answers

solution1 0 2012-04-15 07:48:28

solution2 0 2014-04-17 18:03:34

solution1
0 2012-04-15 07:48:28

solution2
0 2014-04-17 18:03:34