[英]How can we naviagte to a web page, scrape data, move to the next page, and do it again?
我做了兩次嘗試讓我的代碼導航到 web 頁面,將數據從表導入數據框,然后移動到下一頁並再次執行相同的操作。 下面是我測試的一些示例代碼。 現在我被困住了; 不知道如何進行。
# first attempt
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from time import sleep
lst = []
url = "https://www.nasdaq.com/market-activity/stocks/screener"
for numb in (1, 10):
url = "https://www.nasdaq.com/market-activity/stocks/screener"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table')
df = pd.DataFrame(table)
lst.append(df)
def get_cpf():
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
driver.get(url)
driver.find_element_by_class('pagination__page" data-page="'' + numb + ''').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()
get_cpf.click
### second attempt
#import BeautifulSoup
from bs4 import BeautifulSoup
import pandas as pd
import requests
from selenium import webdriver
from time import sleep
lst = []
for numb in (1, 10):
r=requests.get('https://www.nasdaq.com/market-activity/stocks/screener')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"class":"nasdaq-screener__table"} )
for row in table.findAll("tr"):
for cell in row("td"):
data = cell.get_text().strip()
df = pd.DataFrame(data)
lst.append(df)
def get_cpf():
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
driver.get(url)
driver.find_element_by_class('pagination__page" data-page="'' + numb + ''').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()
get_cpf.click
### third attempt
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
import requests
import pandas as pd
lst = []
url="https://www.nasdaq.com/market-activity/stocks/screener"
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"#_evh-ric-c"))).click()
for pages in range(1,9):
try:
print(pages)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table')
df = pd.DataFrame(table)
lst.append(df)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button.pagination__next"))).click()
time.sleep(1)
except:
break
這是我試圖抓取的桌子后面的 HTML 的屏幕截圖。
因此,在第一頁上,我想從以下位置抓取所有內容:
AAPL Apple Inc. Common Stock $127.79 6.53 5.385% 2,215,538,678,600
至:
ASML ASML Holding N.V. New York Registry Shares $583.55 16.46 2.903% 243,056,764,541
然后,移至第 2 頁,執行相同操作,移至第 3 頁,執行相同操作,等等等等。我不確定僅使用 BeautifulSoup 是否可行。 或者我可能需要 Selenium,用於按鈕單擊事件。 我願意做這里最簡單的事情。 謝謝!
請注意,您不需要使用selenium
來執行此類任務,因為它會減慢您的進程。
在實際場景中,我們只使用selenium
繞過瀏覽器檢測,然后我們將 cookies 傳遞給任何 HTTP 模塊以繼續操作。
關於您的任務,我注意到有一個API
實際上提供了HTML
源。
這是一個快速調用它。
import pandas as pd
import requests
def main(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"
}
params = {
'tableonly': 'true',
'limit': 1000
}
r = requests.get(
'https://api.nasdaq.com/api/screener/stocks', params=params, headers=headers)
goal = pd.DataFrame(r.json()['data']['table']['rows'])
print(goal)
goal.to_csv('data.csv', index=False)
if __name__ == "__main__":
main('https://api.nasdaq.com/api/screener/stocks')
請注意,每頁包含 25 個股票代碼。 在我的代碼中,我獲取了
1000/ 25
= 40 頁。
您無需在此處循環瀏覽pages
。 因為您可以與增加限制進行交互!
但是如果您想使用for
循環,則必須循環以下內容
並保持偏移。
https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0
不會處理 API,因為 Nuran 只是按照用戶的要求去 go
這是瀏覽前 10 頁的示例。 首先我們刪除通知。 然后等待下一個按鈕可交互並單擊它。
wait = WebDriverWait(driver, 10)
driver.get("https://www.nasdaq.com/market-activity/stocks/screener")
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"#_evh-ric-c"))).click()
#Currently you start on the 1st page and say we want to click 9 times for the 10th page
for pages in range(1,10):
try:
print(pages)
#Get your data from this page
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button.pagination__next"))).click()
#This is just here to slow everything so it may be removeable.
time.sleep(5)
except:
break
進口
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
你可以做這樣的事情
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
div=soup.select_one("table.nasdaq-screener__table")
table=pd.read_html(str(div))
print(table[0])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.