為什么我不能將我抓取的 html 表保存到 pandas dataframe？

Question

我有一個 python 腳本，可以抓取 html 表。 當我嘗試將抓取的數據保存到 pandas dataframe 時，出現錯誤。 請幫我檢查我做錯了什么？

這是我的代碼塊

from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd

def HDI():
    url = 'https://worldpopulationreview.com/country-rankings/hdi-by-country'

    service = Service(executable_path="C:/driver/new/chromedriver_win32/chromedriver.exe")
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    time.sleep(5)

    btn = driver.find_element(By.CLASS_NAME, '_3p_1XEZR')
    btn.click()
    time.sleep(5)

    temp_height=0

    while True:
        #Looping down the scroll bar
        driver.execute_script("window.scrollBy(0,500)")
        #sleep and let the scroll bar react
        time.sleep(5)
        #Get the distance of the current scroll bar from the top
        check_height = driver.execute_script("return document.documentElement.scrollTop || window.window.pageYOffset || document.body.scrollTop;")
        #If the two are equal to the end
        if check_height==temp_height:
           break
        temp_height=check_height
    time.sleep(3)

    row_headers = []
    tableheads = driver.find_elements(By.CLASS_NAME, 'datatable-th')
    for value in tableheads:
        thead_values = value.find_element(By.CLASS_NAME, 'has-tooltip-bottom').text.strip()

        row_headers.append(thead_values)

    tablebodies = driver.find_elements(By.TAG_NAME, 'tr')
    for row in tablebodies:
        tabledata = row.find_elements(By.CSS_SELECTOR, 'tr, td')

        row_data = []
        for data in tabledata:
            row_data.append(data.text)

    df = pd.DataFrame(row_data, columns=row_headers)
    df 

HDI()

這是我得到的錯誤

File "c:\Users\LP\Documents\python\HD1 2023\HDI2023.py", line 49, in HDI
df = pd.DataFrame(row_data, columns=row_headers)
File "C:\Users\LP\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:\Users\LP\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
  raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (9, 1), indices imply (9, 9)

我想將上面的抓取值保存到 pandas dataframe 中。這就是我的目標。 如果可以的話請幫忙。 謝謝

Answer 1

在您的變量row_data中，您只保存一行，並且在每次迭代中都將其覆蓋。 您可能想使用 DataFrame 中的所有行。例如，您可以創建一個新變量row_data_all並將其傳遞給您的 DataFrame

row_data_all = []
for row in tablebodies:
    tabledata = row.find_elements(selenium.webdriver.common.by.By.CSS_SELECTOR, 'tr, td')

    row_data = []
    for data in tabledata:
        row_data.append(data.text)
    row_data_all.append(row_data)

pd.DataFrame(row_data_all, columns = row_headers)

如果你真的想從一行中創建一個 DataFrame 你應該使用

pd.DataFrame(row_data, index = row_headers).T

選擇

也可以使用pandas的read_html()方法，只需要html源碼即可。 您甚至可以將整個頁面的源代碼傳遞給它，它會返回在源代碼中找到的表的 DataFrame 列表。 這也會大大加快您的 function 速度。

html_table = driver.find_element(By.TAG_NAME, "table").get_attribute("outerHTML")
df = pd.read_html(html_table)[0]

為什么我不能將我抓取的 html 表保存到 pandas dataframe？

問題描述

1 個解決方案

解決方案1
1 已采納 2023-01-25 11:49:52

選擇

為什么我不能將我抓取的 html 表保存到 pandas dataframe？

問題描述

1 個解決方案

解決方案1 1 已采納 2023-01-25 11:49:52

選擇

解決方案1
1 已采納 2023-01-25 11:49:52