简体   繁体   中英

How to extract valid urls of similar pattern?

I am scraping an entire article management system storing thousands of articles. My script works, but the problem is that beautifulsoup and requests both take a long in determining whether the the page is an actual article or an article not found page. I have approximately 4000 articles and by calculating, the amount time the script will run will complete is in days.

for article_url in edit_article_list:
    article_edit_page = s.get(article_url, data=payload).text
    article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')

    # Section
    if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
        continue
    else:
        for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
            f.write(thing.get_text(strip=True) + "\t")

The first if determines whether the url is good or bad. edit_article_list is made by:

for count in range(87418,307725):
    edit_article_list.append(login_url+"AddEditArticle.aspxArticleID="+str(count))

My script right now checks for the bad and good urls and then scrapes the content. Is there any way I can get the valid urls of similar pattern using requests while making the url list?

To skip articles which don't exist, need to not allow redirects and check the status code:

for article_url in edit_article_list:
    r = requests.get(article_url, data=payload, allow_redirects=False)
    if r.status_code != 200:
        continue
    article_edit_page = r.text
    article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')

    # Section
    if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
        continue
    else:
        for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
            f.write(thing.get_text(strip=True) + "\t")

I do though recommend parsing the article list page for the actual urls - you are currently firing off over 200,000 requests and only expecting 4,000 articles, that is a lot of overhead and traffic, and not very efficient!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM