简体   繁体   English

使用python的动态网页爬虫

[英]dynamic web page crawler using python

I wanted to read this article online and something popped and I thought that I want to read it offline after I have successfully extracted it... so here I am after 4 weeks of trials and all the problem is down to is I the crawler can't seem to read the content of the webpages even after all of the ruckus...我想在线阅读这篇文章,突然出现了一些东西,我想在我成功提取它后我想离线阅读它......所以我经过 4 周的试用,所有问题都归结为我的爬虫可以即使在所有的骚动之后,似乎也没有阅读网页的内容......

the initial problem was that all of the info was not present on one page so is used the button to navigate the content of the website itself...最初的问题是所有信息都没有出现在一个页面上,所以使用按钮来导航网站本身的内容......

I've tried BeautifulSoup but it can't seem to parse the page very well.我试过 BeautifulSoup,但它似乎不能很好地解析页面。 I'm using selenium and chromedriver at the moment.我目前正在使用 selenium 和 chromedriver。

The reason for crawler not being able to read the page seems to be the robot.txt file (the waiting time for crawlers for a single page is 3600 and the article has about 10 pages, which is bearable but what would happen if it were to say 100+)and I don't know how to bypass it or go around it.爬虫无法读取页面的原因似乎是robot.txt文件(爬虫单页等待时间3600,文章10页左右,可以忍受,但是如果这样会怎么样?说 100+),我不知道如何绕过它或绕过它。

Any help??有什么帮助吗??

If robots.txt puts limitations then that's the end of it.如果 robots.txt 设置了限制,那么就到此为止。 You should be web-scraping ethically and this means if the owner of the site wants you to wait 3600 seconds between requests then so be it.您应该合乎道德地进行网络抓取,这意味着如果网站所有者希望您在请求之间等待 3600 秒,那么就这样吧。

Even if robots.txt doesn't stipulate wait times you should still be mindful.即使 robots.txt 没有规定等待时间,您仍然应该注意。 Small business / website owners might not know of this and by you hammering a website constantly it could be costly to them.小型企业/网站所有者可能不知道这一点,如果您不断地敲打网站,他们可能会付出高昂的代价。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM