簡體   English   中英

Selenium:從 Coincodex 抓取網頁歷史數據並轉換為 Pandas 數據框

[英]Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe

當我嘗試從https://coincodex.com/crypto/bitcoin/historical-data/使用 Selenium 從多個站點抓取一些歷史數據時,我確實很掙扎。 不知何故,我確實通過以下步驟失敗了:

  1. 從后續頁面獲取數據(不僅是 9 月,也就是第 1 頁)
  2. 將每個值的 '$' 替換為 '$'
  3. 將值 B(十億)轉換為全數(1B 轉換為 1000000000)

預定義的任務是:使用 Selenium 和 BeautifulSoup 對從年初到 9 月底的所有數據進行 Web 抓取,並將其轉換為 Pandas df。 到目前為止我的代碼是:

from selenium import webdriver
import time

URL = "https://coincodex.com/crypto/bitcoin/historical-data/"

driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)

webpage = driver.page_source

from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')

Table = HTMLPage.find('table', class_='styled-table full-size-table')

Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)

# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
 try:
  # Empty dictionary to store data present in each row
  RowDict = {}
  # Extracted all the columns of a row and stored in a variable
  Values = Rows[i].find_all('td')
  
  # Values (Open, High, Close etc.) are extracted and stored in dictionary
  if len(Values) == 7:
   RowDict["Date"] = Values[0].text.replace(',', '')
   RowDict["Open"] = Values[1].text.replace(',', '')
   RowDict["High"] = Values[2].text.replace(',', '')
   RowDict["Low"] = Values[3].text.replace(',', '')
   RowDict["Close"] = Values[4].text.replace(',', '')
   RowDict["Volume"] = Values[5].text.replace(',', '')
   RowDict["Market Cap"] = Values[6].text.replace(',', '')
   extracted_data.append(RowDict)
 except:
  print("Row Number: " + str(i))
 finally:
  # To move to the next row
  i = i + 1

extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)

抱歉,我是 Python 和 Web-Scraping 的新手,希望有人能幫助我。 將不勝感激。

Coincodex 提供了一個查詢 UI,您可以在其中調整時間范圍。 將開始和結束時間設置為 1 月 1 日和 9 月 30 日並單擊“選擇”按鈕后,站點使用https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791端點向后端發送GET請求https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791 如果你向這個 URL 發送請求,你可以從這個時間間隔取回你需要的所有數據:

import requests, json
import pandas as pd
data = json.loads(requests.get('https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791').text)
df = pd.DataFrame(data['data'])

輸出:

              time_start             time_end  price_open_usd  ...  price_avg_ETH    volume_ETH  market_cap_ETH
0    2021-01-01 00:00:00  2021-01-02 00:00:00    28938.896888  ...      39.496780  8.728544e+07    7.341417e+08
1    2021-01-02 00:00:00  2021-01-03 00:00:00    29329.695772  ...      40.934106  9.351177e+07    7.608959e+08
2    2021-01-03 00:00:00  2021-01-04 00:00:00    32148.048500  ...      38.970510  1.448755e+08    7.244327e+08
3    2021-01-04 00:00:00  2021-01-05 00:00:00    32949.399464  ...      31.433580  1.292715e+08    5.843597e+08
4    2021-01-05 00:00:00  2021-01-06 00:00:00    32023.293433  ...      30.478852  1.186652e+08    5.666423e+08
..                   ...                  ...             ...  ...            ...           ...             ...
268  2021-09-26 00:00:00  2021-09-27 00:00:00    42670.363351  ...      14.438247  1.573066e+07    2.718238e+08
269  2021-09-27 00:00:00  2021-09-28 00:00:00    43204.962300  ...      14.157527  1.660821e+07    2.665518e+08
270  2021-09-28 00:00:00  2021-09-29 00:00:00    42111.843283  ...      14.439326  1.782125e+07    2.718712e+08
271  2021-09-29 00:00:00  2021-09-30 00:00:00    41004.598500  ...      14.510256  1.748895e+07    2.732201e+08
272  2021-09-30 00:00:00  2021-10-01 00:00:00    41536.594100  ...      14.454206  1.810257e+07    2.721773e+08

[273 rows x 23 columns]

要從Coincodex網站的所有七列中提取比特幣 (BTC) 歷史數據並將它們打印到文本文件中,您需要為visibility_of_all_elements_located()引入WebDriverWait ,然后使用列表理解,您可以創建一個列表,然后創建一個DataFrame和最后使用以下定位器策略將值導出到不包括索引的文本文件:

代碼塊:

driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
my_list = [[headers], [dates], [opens], [highs], [lows], [closes], [volumes]]
df = pd.DataFrame(my_list)
print(df)

注意:您必須添加以下導入:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
    

控制台輸出:

0  [Date, Open, High, Low, Close, Volume, Market ...
1  [Oct 27, 2021, Oct 28, 2021, Oct 29, 2021, Oct...
2  [$ 60,332, $ 58,438, $ 60,600, $ 62,225, $ 61,...
3  [$ 61,445, $ 61,940, $ 62,945, $ 62,225, $ 62,...
4  [$ 58,300, $ 58,240, $ 60,341, $ 60,860, $ 60,...
5  [$ 58,681, $ 60,439, $ 62,220, $ 61,661, $ 61,...
6  [$ 84.44B, $ 99.67B, $ 86.79B, $ 82.73B, $ 74....

參考

您可以在以下位置找到相關的詳細討論:

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM