[英]Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
當我嘗試從https://coincodex.com/crypto/bitcoin/historical-data/使用 Selenium 從多個站點抓取一些歷史數據時,我確實很掙扎。 不知何故,我確實通過以下步驟失敗了:
預定義的任務是:使用 Selenium 和 BeautifulSoup 對從年初到 9 月底的所有數據進行 Web 抓取,並將其轉換為 Pandas df。 到目前為止我的代碼是:
from selenium import webdriver
import time
URL = "https://coincodex.com/crypto/bitcoin/historical-data/"
driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)
webpage = driver.page_source
from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')
Table = HTMLPage.find('table', class_='styled-table full-size-table')
Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)
# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
try:
# Empty dictionary to store data present in each row
RowDict = {}
# Extracted all the columns of a row and stored in a variable
Values = Rows[i].find_all('td')
# Values (Open, High, Close etc.) are extracted and stored in dictionary
if len(Values) == 7:
RowDict["Date"] = Values[0].text.replace(',', '')
RowDict["Open"] = Values[1].text.replace(',', '')
RowDict["High"] = Values[2].text.replace(',', '')
RowDict["Low"] = Values[3].text.replace(',', '')
RowDict["Close"] = Values[4].text.replace(',', '')
RowDict["Volume"] = Values[5].text.replace(',', '')
RowDict["Market Cap"] = Values[6].text.replace(',', '')
extracted_data.append(RowDict)
except:
print("Row Number: " + str(i))
finally:
# To move to the next row
i = i + 1
extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)
抱歉,我是 Python 和 Web-Scraping 的新手,希望有人能幫助我。 將不勝感激。
Coincodex 提供了一個查詢 UI,您可以在其中調整時間范圍。 將開始和結束時間設置為 1 月 1 日和 9 月 30 日並單擊“選擇”按鈕后,站點使用https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791
端點向后端發送GET
請求https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791
。 如果你向這個 URL 發送請求,你可以從這個時間間隔取回你需要的所有數據:
import requests, json
import pandas as pd
data = json.loads(requests.get('https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791').text)
df = pd.DataFrame(data['data'])
輸出:
time_start time_end price_open_usd ... price_avg_ETH volume_ETH market_cap_ETH
0 2021-01-01 00:00:00 2021-01-02 00:00:00 28938.896888 ... 39.496780 8.728544e+07 7.341417e+08
1 2021-01-02 00:00:00 2021-01-03 00:00:00 29329.695772 ... 40.934106 9.351177e+07 7.608959e+08
2 2021-01-03 00:00:00 2021-01-04 00:00:00 32148.048500 ... 38.970510 1.448755e+08 7.244327e+08
3 2021-01-04 00:00:00 2021-01-05 00:00:00 32949.399464 ... 31.433580 1.292715e+08 5.843597e+08
4 2021-01-05 00:00:00 2021-01-06 00:00:00 32023.293433 ... 30.478852 1.186652e+08 5.666423e+08
.. ... ... ... ... ... ... ...
268 2021-09-26 00:00:00 2021-09-27 00:00:00 42670.363351 ... 14.438247 1.573066e+07 2.718238e+08
269 2021-09-27 00:00:00 2021-09-28 00:00:00 43204.962300 ... 14.157527 1.660821e+07 2.665518e+08
270 2021-09-28 00:00:00 2021-09-29 00:00:00 42111.843283 ... 14.439326 1.782125e+07 2.718712e+08
271 2021-09-29 00:00:00 2021-09-30 00:00:00 41004.598500 ... 14.510256 1.748895e+07 2.732201e+08
272 2021-09-30 00:00:00 2021-10-01 00:00:00 41536.594100 ... 14.454206 1.810257e+07 2.721773e+08
[273 rows x 23 columns]
要從Coincodex網站的所有七列中提取比特幣 (BTC) 歷史數據並將它們打印到文本文件中,您需要為visibility_of_all_elements_located()引入WebDriverWait ,然后使用列表理解,您可以創建一個列表,然后創建一個DataFrame和最后使用以下定位器策略將值導出到不包括索引的文本文件:
代碼塊:
driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
my_list = [[headers], [dates], [opens], [highs], [lows], [closes], [volumes]]
df = pd.DataFrame(my_list)
print(df)
注意:您必須添加以下導入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
控制台輸出:
0 [Date, Open, High, Low, Close, Volume, Market ...
1 [Oct 27, 2021, Oct 28, 2021, Oct 29, 2021, Oct...
2 [$ 60,332, $ 58,438, $ 60,600, $ 62,225, $ 61,...
3 [$ 61,445, $ 61,940, $ 62,945, $ 62,225, $ 62,...
4 [$ 58,300, $ 58,240, $ 60,341, $ 60,860, $ 60,...
5 [$ 58,681, $ 60,439, $ 62,220, $ 61,661, $ 61,...
6 [$ 84.44B, $ 99.67B, $ 86.79B, $ 82.73B, $ 74....
您可以在以下位置找到相關的詳細討論:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.