简体   繁体   English

使用Python BeautifulSoup查找页面数

[英]Finding number of pages using Python BeautifulSoup

I want to extract the total page number (11 in this case) from a steam page. 我想从蒸汽页面中提取总页数(在这种情况下为11)。 I believe that the following code should work (return 11), but it is returning an empty list. 我相信以下代码应该可以工作(返回11),但是它返回的是一个空列表。 Like if it is not finding paged_items_paging_pagelink class. 就像没有找到paged_items_paging_pagelink类一样。

import requests
import re
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
c = r.content
soup = BeautifulSoup(c, 'html.parser')


total_pages = soup.find_all("span",{"class":"paged_items_paging_pagelink"})[-1].text

If you check the page source, the content you want is not available. 如果您检查页面源,则所需的内容不可用。 It means that it is generated dynamically through Javascript. 这意味着它是通过Javascript动态生成的。

The page numbers are located inside the <span id="NewReleases_links"> tag, but in the page source the HTML shows only this: 页码位于<span id="NewReleases_links">标记内,但是在页面源代码中,HTML仅显示以下内容:

<span id="NewReleases_links"></span>

Easiest way to handle this is using Selenium . 解决此问题的最简单方法是使用Selenium

But, if you look at the page source, the text Showing 1-20 of 213 results is available. 但是,如果您查看页面源,则Showing 1-20 of 213 results的文本可用。 So, you can scrape this and calculate the number of pages. 因此,您可以抓取并计算页数。

Required HTML: 所需的HTML:

<div class="paged_items_paging_summary ellipsis">
    Showing 
    <span id="NewReleases_start">1</span>
    -
    <span id="NewReleases_end">20</span> 
    of 
    <span id="NewReleases_total">213</span> 
    results         
</div>

Code: 码:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
soup = BeautifulSoup(r.text, 'lxml')

def get_pages_no(soup):
    total_items = int(soup.find('span', id='NewReleases_total').text)
    items_per_page = int(soup.find('span', id='NewReleases_end').text)
    return round(total_items/items_per_page)

print(get_pages_no(soup))
# prints 11

(Note: I still recommend the use of Selenium, as most of the content from this site is dynamically generated. It'll be a pain to scrape all the data like this.) (注意:我仍然建议使用Selenium,因为此站点上的大多数内容都是动态生成的。像这样刮擦所有数据将很痛苦。)

An alternative faster way without using BeautifulSoup : 不使用BeautifulSoup的另一种更快的方法:

import requests

url = "http://store.steampowered.com/contenthub/querypaginated/tags/NewReleases/render/?query=&start=20&count=20&cc=US&l=english&no_violence=0&no_sex=0&v=4&tag=RPG" # This returns your query in json format
r = requests.get(url)

print(round(r.json()['total_count'] / 20)) # total_count = number of records, 20 = number of pages shown

11 11

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM