简体   繁体   中英

How to crawl every page in a website in Python BeautifulSoup

Is there any way to crawl every page in a URL?

Such as https://gogo.mn/ to find every article page in the URL?

The following is what I have so far. The problem is that the news article patterns are weird for example https://gogo.mn/r/qqm4m

So the code like following will never find the articles.

base_url = 'https://gogo.mn/'
for i in range(number_pages):
    url = base_url+str(i)
    req = requests.get(url)
    soup = BeautifulSoup(req.content)

How do I crawl such websites?

The easiest way is way is to first get the page from the website. This can be accomplished thusly:

url = 'https://gogo.mn/'
response = requests.get(url)

Then your page is contained in the response variable which you can examine by looking at response.text.

Now use BeautifulSoup to find all of the links that are contained on the page:

a_links = html.find_all('a')

This returns a bs4.element.ResultSet type that can be iterated through using a for loop. Looking at your particular site I found that they don't include the baseURL in many of their links so some normalization of the URLS has to be performed.

for link in a_links:
    if ('https' in link['href']) or ('http' in link['href']):
        print (link['href'])
    else:
        xLink = link['href'][1:]
        print (f'{url}{xLink}')

Once you've done that you then have all of the links from a given page. You would then need to eliminate duplicates and for each page run through the links on the new pages. This would involve recursively stepping through all links you find.

Regards

I have not used Scrapy. But to get all the content using only request and BeautifulSoup , you need to find the index page (sometimes archives or search results) of the website, save the urls of all the pages, loop through the urls, and save the content of the pages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM