I am scraping an entire article management system storing thousands of articles. My script works, but the problem is that beautifulsoup
and requests
both take a long in determining whether the the page is an actual article or an article not found page. I have approximately 4000 articles and by calculating, the amount time the script will run will complete is in days.
for article_url in edit_article_list:
article_edit_page = s.get(article_url, data=payload).text
article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')
# Section
if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
continue
else:
for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
f.write(thing.get_text(strip=True) + "\t")
The first if
determines whether the url is good or bad. edit_article_list
is made by:
for count in range(87418,307725):
edit_article_list.append(login_url+"AddEditArticle.aspxArticleID="+str(count))
My script right now checks for the bad and good urls and then scrapes the content. Is there any way I can get the valid urls of similar pattern using requests
while making the url list?
To skip articles which don't exist, need to not allow redirects and check the status code:
for article_url in edit_article_list:
r = requests.get(article_url, data=payload, allow_redirects=False)
if r.status_code != 200:
continue
article_edit_page = r.text
article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')
# Section
if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
continue
else:
for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
f.write(thing.get_text(strip=True) + "\t")
I do though recommend parsing the article list page for the actual urls - you are currently firing off over 200,000 requests and only expecting 4,000 articles, that is a lot of overhead and traffic, and not very efficient!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.