简体   繁体   中英

crawling multiple webpages from a website

I want to extract data from a website. Say, URL is http://www.example.com/ . So I put this URL in start_urls (referring to the example of DMOZ in documentation). But I also want to create a GUI when I enter a string and click on a button it will append that string to the start_urls and extract all the pages which can be accessed like this http://www.example.com/computer/page-1 . So can you please tell me how can I do this using loop? I have tried putting more URLs in start_urls manually to check if it works but it doesn't respond well. Sometimes it gets no response. Any thoughts on that?

How you could do this using a loop?

Friend, that would be some loop. Seriously, I would consider looking into existing Open-Source scripts and applications that do this. You would easily be able to see and have an idea how it can be done . Then of course, you can make whatever you feel like better , all you want. I am quite certain there are many many examples of web spidering solutions available out there. With my limited toolset, I would probably try hacking something with wget controlled via a bash or perl script of some sort, but that is me and is not necessarily favorable to many people.

As for the 'task' itself, if you really want to code it up yourself, consider splitting in sub tasks, Some would see 2 applications doing this task. For example you could have one application could store the links and the other be the 'fetcher', the spider.

And try not to think in terms of 'loops'. There are no loop yet at this stage of your project.

If you are on Linux or have Cygwin / GnuTools installed for windows, like I was hinting I strongly suspect wget may be scripted to do this, go through a list of text links and fetch css, images and maybe even js.

Of course, once all this is working fine from the command line,then you would maybe want a front end to access this in a friendly manner. Again depending on the language / technology stack you use, you will have different options. That is another topic I won't get into.

Hope this helps, cheers!

In a nutshell, you can search for existing Open-Source web spidering ressources on Sourceforge, git-hub, google, etc.

Depending of your needs, Netwoof can do it for you. Can loop on links, multiple resutls pages etc... It is fully automated, generate API and can even qualify unstrutured data in structured data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM