简体繁体中英

dynamic web page crawler using python

原文 2020-08-31 22:22:19 9 1 python/ selenium/ web-crawler/ dynamic-pages

I wanted to read this article online and something popped and I thought that I want to read it offline after I have successfully extracted it... so here I am after 4 weeks of trials and all the problem is down to is I the crawler can't seem to read the content of the webpages even after all of the ruckus...

the initial problem was that all of the info was not present on one page so is used the button to navigate the content of the website itself...

I've tried BeautifulSoup but it can't seem to parse the page very well. I'm using selenium and chromedriver at the moment.

The reason for crawler not being able to read the page seems to be the robot.txt file (the waiting time for crawlers for a single page is 3600 and the article has about 10 pages, which is bearable but what would happen if it were to say 100+)and I don't know how to bypass it or go around it.

Any help??

1 answers

If robots.txt puts limitations then that's the end of it. You should be web-scraping ethically and this means if the owner of the site wants you to wait 3600 seconds between requests then so be it.

Even if robots.txt doesn't stipulate wait times you should still be mindful. Small business / website owners might not know of this and by you hammering a website constantly it could be costly to them.

How to Build a Dynamic Web Scraper/Crawler: Python

python web crawler cannot get full page

Using my Python Web Crawler in my site

Writing a web crawler using python twisted

Stuck Coding a Python web crawler using BeautifulSoup

Href extraction by Web crawler using Python

Using a python web crawler to scrape twitter accounts

Web crawler page iteration

Parsing a Dynamic Web Page using Python

Dynamic web page scraping using python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to Build a Dynamic Web Scraper/Crawler: Python python web crawler cannot get full page Using my Python Web Crawler in my site Writing a web crawler using python twisted Stuck Coding a Python web crawler using BeautifulSoup Href extraction by Web crawler using Python Using a python web crawler to scrape twitter accounts Web crawler page iteration Parsing a Dynamic Web Page using Python Dynamic web page scraping using python

Related Tags

dynamic web page crawler using python

Question

1 answers

solution1 1 ACCPTED 2020-08-31 22:43:33

solution1
1 ACCPTED 2020-08-31 22:43:33