简体   繁体   English

从网站上刮掉用javascript编写的文本

[英]Scrape a text that was written by javascript from website

I am using BeautifulSoup to scrape character info from a website. 我正在使用BeautifulSoup从网站上抓取角色信息。 When trying to get the win rate of a character, BeautifulSoup can't find it. 尝试获取角色的获胜率时,BeautifulSoup找不到它。

When I inspect the text, it is listed as under . 当我检查文本时,其显示为。 All I can find in the sites source code and all that BeautifulSoup finds is "ranking-stats-placeholder". 我在网站的源代码中只能找到所有内容,而BeautifulSoup所找到的只是“ ranking-stats-placeholder”。

This is the code I am currently using. 这是我当前正在使用的代码。

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://u.gg/lol/champions/darius/build/?role=top"

#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#champion name
champ_name = page_soup.findAll("span", {"class":"champion-name"})[0].text

#champion win rate
champ_wr = page.soup.findAll("div", {"class":"win-rate okay-tier"})

I believe that the win rate text is added by javascript, but I have no idea how I can get the text. 我相信获胜率文字是通过javascript添加的,但是我不知道如何获取文字。 The code I currently have returns "None" for champ_wr 我当前拥有的代码为champ_wr返回“ None”

Although this text technically could be in the javascript itself, my first guess is the JS is pulling it in via an ajax request. 尽管从技术上讲这段文本可能在javascript本身中,但我的第一个猜测是JS通过ajax请求将其引入。 Have your program simulate that and you'll probably get all the data you need handed right to you without any scraping involved! 让您的程序进行模拟,您可能会无需刮擦就将所需的所有数据直接交给您!

It will take a little detective work though. 不过,这将需要一些侦探工作。 I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. 建议您打开网络流量记录器(例如Firefox中的“ Web Developer Toolbar”),然后访问该站点。 Focus your attention attention on any/all XmlHTTPRequests. 将注意力集中在任何/所有XmlHTTPRequest上。

Best of luck! 祝你好运!

not sure how tied to BeautifulSoup you are, but I can get selenium doing useful things with: 不确定您与BeautifulSoup有多紧密联系,但是我可以通过以下方法使硒做有用的事情:

# load code from selenium package
from selenium.webdriver import Remote
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

# start an instance of Chrome up
chrome = Service('/usr/local/bin/chromedriver')
chrome.start()
driver = Remote(chrome.service_url)

# get the page loading
driver.get("https://u.gg/lol/champions/darius/build/?role=top")

# wait for the win rate to be populated
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "win-rate")))

# get the values you wanted
name = driver.find_element_by_class_name("champion-name").text
winrate = driver.find_element_by_class_name("win-rate").text

# display them
print(f"name: {repr(name)}, winrate: {winrate.split()[0]}")

# clean up a bit
driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM