简体   繁体   English

如何使用 BeautifulSoup 抓取网站中的每个页面

[英]How to crawl every page in a website using BeautifulSoup

Is there any way to crawl every page in a URL?有什么方法可以抓取 URL 中的每一页?

Such as https://gogo.mn/ to find every article page in the URL?比如https://gogo.mn/在URL中找到每篇文章页面?

The following is what I have so far以下是我到目前为止所拥有的

import urllib
import urlparse
import re
from bs4 import BeautifulSoup

url = "https://gogo.mn/"
urls = []

soup = BeautifulSoup(urllib.urlopen(url).read())
for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            urls.append(tag['href'])

For some reason this code does not crawl through all the pages.由于某种原因,此代码不会爬过所有页面。 How do I achieve that?我该如何做到这一点?

One way is to use selenium web driver that handles pagination (clicking on the page button and scraping).一种方法是使用 selenium web 驱动程序来处理分页(单击页面按钮并抓取)。

Another way is with BeautifulSoup which you are looking for.另一种方法是使用您正在寻找的 BeautifulSoup。 Here, you need to understand the format of the page links ie if the main page format is as google.com/ , page-1 format is as google.com/-1 , page-2 format is as google.com/-2 and so on, then you create a base url which is google.com/- .在这里,您需要了解页面链接的格式,即如果主页格式为google.com/ , page-1 格式为google.com/-1 , page-2 格式为google.com/-2依此类推,然后您创建一个基础 url 即google.com/- Then using loop, add page numbers to the base url, requests the data from the concatenated url until the last page and you will get from every page.然后使用循环,将页码添加到基础 url,从连接的 url 请求数据,直到最后一页,您将从每一页中获取。 Refer to the below code for example例如参考下面的代码

base_url = 'google.com/-'
for i in range(number_pages):
    url = base_url+str(i)
    req = requests.get(url)
    soup = BeautifulSoup(req.content)

Note that, the above is just an example.请注意,以上只是一个示例。 Overall theme is to understand the link patter and accordingly create links corresponding to every page, loop and get the data.总体主题是了解链接模式并相应地创建与每个页面对应的链接,循环并获取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM