简体   繁体   English

python使用selenium发送空格键滚动和漂亮的汤解析html

[英]python using selenium to send spacebar to scroll and beautiful soup to parse html


browser.get("https://steamcommunity.com/app/933110/reviews/?browsefilter=toprated&snr=1_5_100010_")
    url = browser.current_url
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    actions = ActionChains(browser)

    for y in range(50):
        for x in soup.find_all("div", {"class": "apphub_CardTextContent"}):
            print(x.text.strip() + "\n")
            actions.send_keys(Keys.SPACE).perform()
            time.sleep(1)

I am trying to get the text of every review of this game on the Steam store.我试图在 Steam 商店中获取此游戏的每条评论的文本。 There are over 2000 reviews.有超过 2000 条评论。 The function I attempted to write will get the same reviews that are within the parsing area of what is loaded but the 1900 other reviews are not being printed.我尝试编写的函数将获得加载内容解析区域内的相同评论,但不会打印 1900 条其他评论。 I need the code to scroll down and get every review.我需要代码向下滚动并获得每条评论。 I used 50 as a test but unsure how many space bars I will need to get to the bottom of the page.我使用 50 作为测试,但不确定需要多少个空格才能到达页面底部。 The main problem is that when I print the HTML using prettify() the HTML doesn't show every review.主要问题是,当我使用 prettify() 打印 HTML 时,HTML 不会显示每条评论。

try this code with requests and BS4用请求和 BS4 试试这个代码

import selenium
import requests
from bs4 import BeautifulSoup
import requests

for i in range (0,10):
    url = "https://steamcommunity.com/app/933110/homecontent/?userreviewscursor=AoIIPwcQI3XbjqgC&userreviewsoffset={0}&p={1}&workshopitemspage={1}&readytouseitemspage={1}&mtxitemspage={1}&itemspage={1}&screenshotspage={1}&videospage={1}&artpage={1}&allguidepage={1}&webguidepage={1}&integratedguidepage={1}&discussionspage={1}&numperpage={1}&browsefilter=toprated&browsefilter=toprated&appid=933110&appHubSubSection=10&appHubSubSection=10&l=english&filterLanguage=default&searchText=&forceanon=1".format(i*10,i+1)
    payload = {}
    headers= {}
    response = requests.request("GET", url, headers=headers, data = payload)
    soup = BeautifulSoup(response.text.encode('utf8'),features="html.parser")
    for x in soup.find_all("div", {"class": "apphub_CardTextContent"}):
        print(x.text.strip() + "\n")

Change range value accordingly, for range (1,10) first 10 pages posts will be displayed.相应地更改范围值,对于范围 (1,10) 将显示前 10 页帖子。

Re: scrolling to the bottom of the page consistently, I've found it's easiest to use execute_script and let javascript do that for you.回复:始终滚动到页面底部,我发现使用execute_script并让 javascript 为您执行此操作是最简单的。 That would look like this in your example:在您的示例中看起来像这样:

browser.execute_script("scroll(0, document.body.parentNode.scrollHeight)")

Re: not seeing all of the claims, I think the problem is that you define soup before your loop.回复:没有看到所有声明,我认为问题在于您在循环之前定义了soup I believe this means that when you call soup.find_all() inside the loop, you only ever parse that initial page you obtained with page = requests.get(url).text .我相信这意味着当您在循环内调用soup.find_all() ,您只会解析使用page = requests.get(url).text获得的初始页面。 Since Selenium and Beautiful Soup aren't connected, you'll have to make a new request for every new section of reviews in order to load the response into soup and parse from there.由于 Selenium 和 Beautiful Soup 未连接,因此您必须为评论的每个新部分发出新请求,以便将响应加载到soup并从那里进行解析。

If you want to go the Selenium route, you could do the scrolling and get the text from the elements using selenium's methods: browser.find_elements_by_class_name("apphub_CardTextContent") , and then get the text when you're looping with element.get_attribute("innerText")如果您想使用 Selenium 路线,您可以使用 selenium 的方法进行滚动并从元素中获取文本: browser.find_elements_by_class_name("apphub_CardTextContent") ,然后在使用element.get_attribute("innerText")

I would lean more towards just using requests to get the html documents and parse with Beautiful Soup, just because that means less DOM interaction, which usually means fewer bugs (in my experience).我更倾向于使用requests来获取 html 文档并使用 Beautiful Soup 进行解析,只是因为这意味着更少的 DOM 交互,这通常意味着更少的错误(以我的经验)。 However, from looking at the site, it seems like you might have to dig in to replicating their requests for the html of the extra reviews.但是,通过查看该站点,您似乎必须深入研究以复制他们对额外评论 html 的请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM