如何使用 Selenium 刮取更新的 HTML 表？

Question

I am looking to scrape the coin table from link and create a CSV file datewise.我希望从链接中刮掉硬币表并按日期创建 CSV 文件。 For every new coin update, a new entry at the top should be created in the existing data file.对于每个新的硬币更新，应在现有数据文件中在顶部创建一个新条目。

Desired output所需 output

Coin,Pings,...Datetime

BTC,25,...07:17:05 03/18/21

I haven't reached far but below is my attempt at it我还没有走多远，但下面是我的尝试

from selenium import webdriver
import numpy as np
import pandas as pd

firefox = webdriver.Firefox(executable_path="/usr/local/bin/geckodriver")
firefox.get('https://agile-cliffs-23967.herokuapp.com/binance/')

rows = len(firefox.find_elements_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr"))
columns = len(firefox.find_elements_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr[1]/th"))

df = pd.DataFrame(columns=['Coin','Pings','Net Vol BTC','Net Vol per','Recent Total Vol BTC', 'Recent Vol per', 'Recent Net Vol', 'Datetime'])

for r in range(1, rows+1):
    for c in range(1, columns+1): 
        value = firefox.find_element_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr["+str(r)+"]/th["+str(c)+"]").text
        print(value)
        
#         df.loc[i, ['Coin']] =

Answer 1

Since the data is loaded dynamically you can retrieve it directly from the source, no Selenium needed.由于数据是动态加载的，您可以直接从源中检索它，不需要Selenium 。 It will return json with rows with |它将返回 json 行与| -delimited values that need to be split and can be appended to the DataFrame . - 需要拆分并可以附加到DataFrame的分隔值。 Since the site updates once per minute, you can wrap everything in a while True that makes the code run every 60 seconds :由于站点每分钟更新一次，您可以在一段while True使代码每 60 秒运行一次：

import requests
import time
import json
import pandas as pd

headers = ['Coin','Pings','Net Vol BTC','Net Vol %','Recent Total Vol BTC', 'Recent Vol %', 'Recent Net Vol', 'Datetime (UTC)']
df = pd.DataFrame(columns=headers)

s = requests.Session()
starttime = time.time()

while True:
    response = s.get('https://agile-cliffs-23967.herokuapp.com/ok', headers={'Connection': 'keep-alive'})
    d = json.loads(response.text)
    rows = [str(i).split('|') for i in d['resu'][:-1]]
    if rows:
        data = [dict(zip(headers, l)) for l in rows]
        df = df.append(data, ignore_index=True)
        df.to_csv('filename.csv', index=False)
    time.sleep(60.0 - ((time.time() - starttime) % 60.0))

Answer 2

You can append row data to a DataFrame by putting it into a dictionary:您可以通过将 append 行数据放入字典中，将其转换为 DataFrame：

# We reuse the headers when building dicts below
headers = ['Coin','Pings','Net Vol BTC','Net Vol per','Recent Total Vol BTC', 'Recent Vol per', 'Recent Net Vol', 'Datetime']
df = pd.DataFrame(columns=headers)

for r in range(1, rows+1):
    data = [firefox.find_element_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr["+str(r)+"]/th["+str(c)+"]").text \
                for c in range(1, columns+1)]
    row_dict = dict(zip(headers, data))
    df = df.append(row_dict, ignore_index=True)

如何使用 Selenium 刮取更新的 HTML 表？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-03-18 09:49:18

解决方案2
0 2021-03-18 07:57:32

如何使用 Selenium 刮取更新的 HTML 表？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-03-18 09:49:18

解决方案2 0 2021-03-18 07:57:32

解决方案1
1 已采纳 2021-03-18 09:49:18

解决方案2
0 2021-03-18 07:57:32