简体   繁体   中英

How can I find the right xpath and loop over table?

I would like to get all the values from the table "Elektriciteit NL" on https://powerhouse.net/forecast-prijzen-onbalans/ . However after endlessly trying to find the right xpath using selenium I was not able to scrape the table.

I tried to use "inspect" and copy the xpath from the table to identify the length of the table for scraping later. After this failed I tried to use "contain" however this was not succesfull either. Afterwards i tried some things using BeautifullSoup however without any luck.

#%%
import pandas as pd

from selenium import webdriver
import pandas as pd
#%% powerhouse Elektriciteit NL base & peak

url = "https://powerhouse.net/forecast-prijzen-onbalans/"

#%% open webpagina
driver = webdriver.Chrome(executable_path = path + 'chromedriver.exe')
driver.get(url)

#%%
prices = []


#loop for values in table
for j in range(len(driver.find_elements_by_xpath('//tr[@id="endex_nl_forecast"]/div[3]/table/tbody/tr[1]/td[4]'))):
    base = driver.find_elements_by_xpath('//tr[@id="endex_nl_forecast"]/div[3]/table/tbody/tr[1]/td[4]')[j]


#%%
#trying with BeautifulSoup
from bs4 import BeautifulSoup
import requests 


response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

table  = soup.find('table', id = 'endex_nl_forecast')
rows = soup.find_all('tr')

I would like to have the table in a dataframe and understand how xpath exactly works. I'm kind of new to the whole concept.

If you are open to ways other than xpath you could do this without selenium or xpath:

you could just use pandas

import pandas as pd

table = pd.read_html('https://powerhouse.net/forecast-prijzen-onbalans/')[4]

If you want text representation of icons you could extract the class name of the svg which describes arrow direction from the appropriate td s.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://powerhouse.net/forecast-prijzen-onbalans/')
soup = bs(r.content, 'lxml')
table = soup.select_one('#endex_nl_forecast table')
rows = []
headers = [i.text for i in table.select('th')]

for tr in table.select('tr')[1:]:
    rows.append([i.text if i.svg is None else i.svg['class'][2].split('-')[-1] for i in tr.select('td') ])

df = pd.DataFrame(rows, columns = headers)
print(df)

Sample rows:

在此处输入图片说明

You can use Selenium driver to locate the table & its contents,

url = 'https://powerhouse.net/forecast-prijzen-onbalans/'
driver.get(url)

time.sleep(3)

To Read Table Headers & Print

tableHeader = driver.find_elements_by_xpath("//*[@id='endex_nl_forecast']//thead//th")
print(tableHeader)
for header in tableHeader:
    print(header.text)

To Find number of rows in the table

rowElements = driver.find_elements_by_xpath("//*[@id='endex_nl_forecast']//tbody/tr")
print('Total rows in the table:', len(rowElements))

To print each rows as is

for row in rowElements:
    print(row.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM