简体   繁体   中英

dynamic web page crawler using python

I wanted to read this article online and something popped and I thought that I want to read it offline after I have successfully extracted it... so here I am after 4 weeks of trials and all the problem is down to is I the crawler can't seem to read the content of the webpages even after all of the ruckus...

the initial problem was that all of the info was not present on one page so is used the button to navigate the content of the website itself...

I've tried BeautifulSoup but it can't seem to parse the page very well. I'm using selenium and chromedriver at the moment.

The reason for crawler not being able to read the page seems to be the robot.txt file (the waiting time for crawlers for a single page is 3600 and the article has about 10 pages, which is bearable but what would happen if it were to say 100+)and I don't know how to bypass it or go around it.

Any help??

If robots.txt puts limitations then that's the end of it. You should be web-scraping ethically and this means if the owner of the site wants you to wait 3600 seconds between requests then so be it.

Even if robots.txt doesn't stipulate wait times you should still be mindful. Small business / website owners might not know of this and by you hammering a website constantly it could be costly to them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM