简体   繁体   English

如何使用 selenium 和 beautiful soup 抓取隐藏的 class 数据

[英]How to scrape hidden class data using selenium and beautiful soup

I'm trying to scrape java script enabled web page content.我正在尝试抓取启用 java 脚本的 web 页面内容。 I need to extract data in the table of that website.我需要提取该网站表格中的数据。 However each row of the table has button (arrow) by which we get additional information of that row.然而,表格的每一行都有按钮(箭头),通过它我们可以获得该行的其他信息。

I need to extract that additional description of each row.我需要提取每一行的附加描述。 By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code.通过检查发现每一行箭头的内容都属于同一个class。但是class隐藏在源代码中。 It can be observed only while inspecting.只有在检查时才能观察到。 The data I'm trying to sparse is from the webpage .我试图稀疏的数据来自网页

I have used selenium and beautiful soup.我用过selenium和美汤。 I'm able to scrape data of table but not content of those arrows in the table.我能够抓取表格数据,但不能抓取表格中那些箭头的内容。 My python is returning me an empty list for the class of that arrow.我的 python 返回给我该箭头的 class 的空列表。 But working for the classs of normal table data.但适用于普通表数据的类。

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source  
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)

The content you are interested in is generated when you click a button, so you would want to locate the button.您感兴趣的内容是在您单击按钮时生成的,因此您会希望找到该按钮。 A million ways you could do this but I would suggest something like:你可以用一百万种方法来做到这一点,但我建议是这样的:

element = driver.find_elements(By.XPATH, '//button')

for your specific case you could also use:对于您的具体情况,您还可以使用:

element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')

Once you get the button element, we can then do:获得按钮元素后,我们可以执行以下操作:

element.click()

Parsing the page after this should get you the javascript generated content you are looking for在此之后解析页面应该会为您提供您正在寻找的 javascript 生成的内容

To print hidden data, you can use this example:要打印隐藏数据,您可以使用此示例:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']

data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))

# uncomment this to see all data:
# print(json.dumps(data, indent=4))

for d in data[4:]:
    print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))

Prints:印刷:

Company                                           Layoffs   City                          County                        Month                         Industry                      Company description           
Tesla (Temporary layoffs. Factory reopened)       11083     Fremont                       Alameda County                April                         Industrial                    Car maker                     
Bon Appetit Management Co.                        3015      San Francisco                 San Francisco County          April                         Food                          Food supplier                 
GSW Arena LLC-Chase Center                        1720      San Francisco                 San Francisco County          May                           Sports                        Arena vendors                 
YMCA of Silicon Valley                            1657      Santa Clara                   Santa Clara County            May                           Sports                        Gym                           
Nutanix Inc. (Temporary furlough of 2 weeks)      1434      San Jose                      Santa Clara County            April                         Tech                          Cloud computing               
TeamSanJose                                       1304      San Jose                      Santa Clara County            April                         Travel                        Tourism bureau                
San Francisco Giants                              1200      San Francisco                 San Francisco County          April                         Sports                        Stadium vendors               
Lyft                                              982       San Francisco                 San Francisco County          April                         Tech                          Ride hailing                  
YMCA of San Francisco                             959       San Francisco                 San Francisco County          May                           Sports                        Gym                           
Hilton San Francisco Union Square                 923       San Francisco                 San Francisco County          April                         Travel                        Hotel                         
Six Flags Discovery Kingdom                       911       Vallejo                       Solano County                 June                          Entertainment                 Amusement park                
San Francisco Marriott Marquis                    808       San Francisco                 San Francisco County          April                         Travel                        Hotel                         
Aramark                                           777       Oakland                       Alameda County                April                         Food                          Food supplier                 
The Palace Hotel                                  774       San Francisco                 San Francisco County          April                         Travel                        Hotel                         
Back of the House Inc                             743       San Francisco                 San Francisco County          April                         Food                          Restaurant                    
DPR Construction                                  715       Redwood City                  San Mateo County              April                         Real estate                   Construction                  

...and so on.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM