简体   繁体   中英

How to crawl every page in a website using BeautifulSoup

Is there any way to crawl every page in a URL?

Such as https://gogo.mn/ to find every article page in the URL?

The following is what I have so far

import urllib
import urlparse
import re
from bs4 import BeautifulSoup

url = "https://gogo.mn/"
urls = []

soup = BeautifulSoup(urllib.urlopen(url).read())
for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            urls.append(tag['href'])

For some reason this code does not crawl through all the pages. How do I achieve that?

One way is to use selenium web driver that handles pagination (clicking on the page button and scraping).

Another way is with BeautifulSoup which you are looking for. Here, you need to understand the format of the page links ie if the main page format is as google.com/ , page-1 format is as google.com/-1 , page-2 format is as google.com/-2 and so on, then you create a base url which is google.com/- . Then using loop, add page numbers to the base url, requests the data from the concatenated url until the last page and you will get from every page. Refer to the below code for example

base_url = 'google.com/-'
for i in range(number_pages):
    url = base_url+str(i)
    req = requests.get(url)
    soup = BeautifulSoup(req.content)

Note that, the above is just an example. Overall theme is to understand the link patter and accordingly create links corresponding to every page, loop and get the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM