使用Python BeautifulSoup查找页面数

Question

I want to extract the total page number (11 in this case) from a steam page. 我想从蒸汽页面中提取总页数（在这种情况下为11）。 I believe that the following code should work (return 11), but it is returning an empty list. 我相信以下代码应该可以工作（返回11），但是它返回的是一个空列表。 Like if it is not finding paged_items_paging_pagelink class. 就像没有找到paged_items_paging_pagelink类一样。

import requests
import re
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
c = r.content
soup = BeautifulSoup(c, 'html.parser')


total_pages = soup.find_all("span",{"class":"paged_items_paging_pagelink"})[-1].text

Answer 1

If you check the page source, the content you want is not available. 如果您检查页面源，则所需的内容不可用。 It means that it is generated dynamically through Javascript. 这意味着它是通过Javascript动态生成的。

The page numbers are located inside the <span id="NewReleases_links"> tag, but in the page source the HTML shows only this: 页码位于<span id="NewReleases_links">标记内，但是在页面源代码中，HTML仅显示以下内容：

<span id="NewReleases_links"></span>

Easiest way to handle this is using Selenium . 解决此问题的最简单方法是使用Selenium 。

But, if you look at the page source, the text Showing 1-20 of 213 results is available. 但是，如果您查看页面源，则Showing 1-20 of 213 results的文本可用。 So, you can scrape this and calculate the number of pages. 因此，您可以抓取并计算页数。

Required HTML: 所需的HTML：

<div class="paged_items_paging_summary ellipsis">
    Showing 
    <span id="NewReleases_start">1</span>
    -
    <span id="NewReleases_end">20</span> 
    of 
    <span id="NewReleases_total">213</span> 
    results         
</div>

Code: 码：

import requests
from bs4 import BeautifulSoup

r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
soup = BeautifulSoup(r.text, 'lxml')

def get_pages_no(soup):
    total_items = int(soup.find('span', id='NewReleases_total').text)
    items_per_page = int(soup.find('span', id='NewReleases_end').text)
    return round(total_items/items_per_page)

print(get_pages_no(soup))
# prints 11

(Note: I still recommend the use of Selenium, as most of the content from this site is dynamically generated. It'll be a pain to scrape all the data like this.) （注意：我仍然建议使用Selenium，因为此站点上的大多数内容都是动态生成的。像这样刮擦所有数据将很痛苦。）

Answer 2

An alternative faster way without using BeautifulSoup : 不使用BeautifulSoup的另一种更快的方法：

import requests

url = "http://store.steampowered.com/contenthub/querypaginated/tags/NewReleases/render/?query=&start=20&count=20&cc=US&l=english&no_violence=0&no_sex=0&v=4&tag=RPG" # This returns your query in json format
r = requests.get(url)

print(round(r.json()['total_count'] / 20)) # total_count = number of records, 20 = number of pages shown

11 11

使用Python BeautifulSoup查找页面数

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-02-28 17:34:54

解决方案2
2 2018-02-28 17:39:33

使用Python BeautifulSoup查找页面数

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-02-28 17:34:54

解决方案2 2 2018-02-28 17:39:33

解决方案1
2 已采纳 2018-02-28 17:34:54

解决方案2
2 2018-02-28 17:39:33