简体   繁体   English

如何在 Python BeautifulSoup 中抓取网站中的每个页面

[英]How to crawl every page in a website in Python BeautifulSoup

Is there any way to crawl every page in a URL?有什么方法可以抓取 URL 中的每一页?

Such as https://gogo.mn/ to find every article page in the URL?比如https://gogo.mn/在URL中找到每篇文章页面?

The following is what I have so far.以下是我到目前为止所拥有的。 The problem is that the news article patterns are weird for example https://gogo.mn/r/qqm4m问题是新闻文章的模式很奇怪,例如https://gogo.mn/r/qqm4m

So the code like following will never find the articles.所以像下面这样的代码永远找不到文章。

base_url = 'https://gogo.mn/'
for i in range(number_pages):
    url = base_url+str(i)
    req = requests.get(url)
    soup = BeautifulSoup(req.content)

How do I crawl such websites?如何抓取此类网站?

The easiest way is way is to first get the page from the website.最简单的方法是首先从网站获取页面。 This can be accomplished thusly:可以这样实现:

url = 'https://gogo.mn/'
response = requests.get(url)

Then your page is contained in the response variable which you can examine by looking at response.text.然后您的页面包含在 response 变量中,您可以通过查看 response.text 来检查该变量。

Now use BeautifulSoup to find all of the links that are contained on the page:现在使用 BeautifulSoup 查找页面上包含的所有链接:

a_links = html.find_all('a')

This returns a bs4.element.ResultSet type that can be iterated through using a for loop.这将返回一个 bs4.element.ResultSet 类型,可以使用 for 循环对其进行迭代。 Looking at your particular site I found that they don't include the baseURL in many of their links so some normalization of the URLS has to be performed.查看您的特定站点,我发现他们的许多链接中都没有包含 baseURL,因此必须对 URLS 进行一些规范化。

for link in a_links:
    if ('https' in link['href']) or ('http' in link['href']):
        print (link['href'])
    else:
        xLink = link['href'][1:]
        print (f'{url}{xLink}')

Once you've done that you then have all of the links from a given page.完成后,您将拥有给定页面的所有链接。 You would then need to eliminate duplicates and for each page run through the links on the new pages.然后,您需要消除重复项,并为每个页面运行新页面上的链接。 This would involve recursively stepping through all links you find.这将涉及递归地遍历您找到的所有链接。

Regards问候

I have not used Scrapy.我没有使用过 Scrapy。 But to get all the content using only request and BeautifulSoup , you need to find the index page (sometimes archives or search results) of the website, save the urls of all the pages, loop through the urls, and save the content of the pages.但是要仅使用requestBeautifulSoup获取所有内容,您需要找到网站的索引页面(有时是存档或搜索结果),保存所有页面的 url,循环遍历 url,并保存页面内容.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM