How to extract valid urls of similar pattern?

Question

I am scraping an entire article management system storing thousands of articles. My script works, but the problem is that beautifulsoup and requests both take a long in determining whether the the page is an actual article or an article not found page. I have approximately 4000 articles and by calculating, the amount time the script will run will complete is in days.

for article_url in edit_article_list:
    article_edit_page = s.get(article_url, data=payload).text
    article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')

    # Section
    if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
        continue
    else:
        for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
            f.write(thing.get_text(strip=True) + "\t")

The first if determines whether the url is good or bad. edit_article_list is made by:

for count in range(87418,307725):
    edit_article_list.append(login_url+"AddEditArticle.aspxArticleID="+str(count))

My script right now checks for the bad and good urls and then scrapes the content. Is there any way I can get the valid urls of similar pattern using requests while making the url list?

Answer 1

To skip articles which don't exist, need to not allow redirects and check the status code:

for article_url in edit_article_list:
    r = requests.get(article_url, data=payload, allow_redirects=False)
    if r.status_code != 200:
        continue
    article_edit_page = r.text
    article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')

    # Section
    if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
        continue
    else:
        for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
            f.write(thing.get_text(strip=True) + "\t")

I do though recommend parsing the article list page for the actual urls - you are currently firing off over 200,000 requests and only expecting 4,000 articles, that is a lot of overhead and traffic, and not very efficient!

How to extract valid urls of similar pattern?

Question

1 answers

solution1
1 ACCPTED 2017-11-20 11:13:04

How to extract valid urls of similar pattern?

Question

1 answers

solution1 1 ACCPTED 2017-11-20 11:13:04

solution1
1 ACCPTED 2017-11-20 11:13:04