[英]How to scrape hidden class data using selenium and beautiful soup
I'm trying to scrape java script enabled web page content.我正在尝试抓取启用 java 脚本的 web 页面内容。 I need to extract data in the table of that website.我需要提取该网站表格中的数据。 However each row of the table has button (arrow) by which we get additional information of that row.然而,表格的每一行都有按钮(箭头),通过它我们可以获得该行的其他信息。
I need to extract that additional description of each row.我需要提取每一行的附加描述。 By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code.通过检查发现每一行箭头的内容都属于同一个class。但是class隐藏在源代码中。 It can be observed only while inspecting.只有在检查时才能观察到。 The data I'm trying to sparse is from the webpage .我试图稀疏的数据来自网页。
I have used selenium and beautiful soup.我用过selenium和美汤。 I'm able to scrape data of table but not content of those arrows in the table.我能够抓取表格数据,但不能抓取表格中那些箭头的内容。 My python is returning me an empty list for the class of that arrow.我的 python 返回给我该箭头的 class 的空列表。 But working for the classs of normal table data.但适用于普通表数据的类。
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)
The content you are interested in is generated when you click a button, so you would want to locate the button.您感兴趣的内容是在您单击按钮时生成的,因此您会希望找到该按钮。 A million ways you could do this but I would suggest something like:你可以用一百万种方法来做到这一点,但我建议是这样的:
element = driver.find_elements(By.XPATH, '//button')
for your specific case you could also use:对于您的具体情况,您还可以使用:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
Once you get the button element, we can then do:获得按钮元素后,我们可以执行以下操作:
element.click()
Parsing the page after this should get you the javascript generated content you are looking for在此之后解析页面应该会为您提供您正在寻找的 javascript 生成的内容
To print hidden data, you can use this example:要打印隐藏数据,您可以使用此示例:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
Prints:印刷:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.