![](/img/trans.png)
[英]How to scrape hidden phone number from website using Beautiful Soup 4
[英]How to scrape hidden class data using selenium and beautiful soup
我正在嘗試抓取啟用 java 腳本的 web 頁面內容。 我需要提取該網站表格中的數據。 然而,表格的每一行都有按鈕(箭頭),通過它我們可以獲得該行的其他信息。
我需要提取每一行的附加描述。 通過檢查發現每一行箭頭的內容都屬於同一個class。但是class隱藏在源代碼中。 只有在檢查時才能觀察到。 我試圖稀疏的數據來自網頁。
我用過selenium和美湯。 我能夠抓取表格數據,但不能抓取表格中那些箭頭的內容。 我的 python 返回給我該箭頭的 class 的空列表。 但適用於普通表數據的類。
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)
您感興趣的內容是在您單擊按鈕時生成的,因此您會希望找到該按鈕。 你可以用一百萬種方法來做到這一點,但我建議是這樣的:
element = driver.find_elements(By.XPATH, '//button')
對於您的具體情況,您還可以使用:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
獲得按鈕元素后,我們可以執行以下操作:
element.click()
在此之后解析頁面應該會為您提供您正在尋找的 javascript 生成的內容
要打印隱藏數據,您可以使用此示例:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
印刷:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.