如何使用 Selenium,Beautiful Soup,Pandas 從網站的多個頁面中提取實際數據？

Question

我是使用 Python 提取數據的新手。 我想做 excel 文件作為從網站拉表。

The website url : "https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml"

在這個網頁中，小時數據的表格在單獨的頁面中。由於一小時包括大約500個數據，因此頁面被划分。
我想每小時提取所有數據。 但我的錯誤是即使頁面發生變化也拉同一張桌子。
我正在使用漂亮的湯，pandas，selenium 庫。 我將向您展示我解釋自己的代碼。

import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd 
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"] 
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1) 
tomorrow = str(tomorrow)
words = tarih.split('.')  
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
    tablo = source.find_all("table")#sayfadaki tablo 
    dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
    dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
    print(dfs)
    return tablo 
if tomorrow == yeni_tarih :
    print(yeni_tarih == tomorrow)
    driver = webdriver.Chrome("C:/Users/tugba.ozkan/AppData/Local/SeleniumBasic/chromedriver.exe")
    driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
    time.sleep(1)
    driver.find_element_by_xpath("//select/option[@value='96']").click()
    time.sleep(1)
    user = driver.find_element_by_name("j_idt206:txt1")
    nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")
    num=0
    while num < 24 :
        user.send_keys(num) #saate veri gönder 
        driver.find_element_by_id('j_idt206:goster').click() #saati uygula
        nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
        nextpage.click() #next page e geç 
        user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini 
        time.sleep(1)
        tablo_cek()
        num = num + 1 #saati bir arttır
        user.clear() #saati sıfırla
else:
    print("Güncelleme gelmedi")

在這個情況下：

nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
            nextpage.click()

當 python 點擊按鈕到 go 到下一頁時，下一頁顯示然后它需要拉下一張表，如圖所示。 但它不起作用。 在 output 我看到附加表是相同的值。像這樣：這是我的 output：

[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0                0      25.0101   19.15990
1                1      24.9741   19.16390
2                2      24.9741   19.18510
3               85      24.9741   19.18512
4               86      24.9736   19.20762
5               99      24.9736   19.20763
6              100      24.6197   19.20763
7              101      24.5697   19.20763
8              300      24.5697   19.20768
9              301      24.5697   19.20768
10             363      24.5697   19.20770
11             364      24.5497   19.20770
12             400      24.5497   19.20771
13             401      24.5297   19.20771
14             498      24.5297   19.20773
15             499      24.5297   19.36473
16             500      24.5297   19.36473
17             501      24.4097   19.36473
18             563      24.4097   19.36475
19             564      24.3897   19.36475
20             999      24.3897   19.36487
21            1000      24.3097   19.36487
22            1001      24.1897   19.36487
23            1449      24.1897   19.36499, [...]]
[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0                0      25.0101   19.15990
1                1      24.9741   19.16390
2                2      24.9741   19.18510
3               85      24.9741   19.18512
4               86      24.9736   19.20762
5               99      24.9736   19.20763
6              100      24.6197   19.20763
7              101      24.5697   19.20763
8              300      24.5697   19.20768
9              301      24.5697   19.20768
10             363      24.5697   19.20770
11             364      24.5497   19.20770
12             400      24.5497   19.20771
13             401      24.5297   19.20771
14             498      24.5297   19.20773
15             499      24.5297   19.36473
16             500      24.5297   19.36473
17             501      24.4097   19.36473
18             563      24.4097   19.36475
19             564      24.3897   19.36475
20             999      24.3897   19.36487
21            1000      24.3097   19.36487
22            1001      24.1897   19.36487
23            1449      24.1897   19.36499, [...]]
[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0                0      25.0101   19.15990
1                1      24.9741   19.16390
2                2      24.9741   19.18510
3               85      24.9741   19.18512
4               86      24.9736   19.20762
5               99      24.9736   19.20763
6              100      24.6197   19.20763
7              101      24.5697   19.20763
8              300      24.5697   19.20768
9              301      24.5697   19.20768
10             363      24.5697   19.20770
11             364      24.5497   19.20770
12             400      24.5497   19.20771
13             401      24.5297   19.20771
14             498      24.5297   19.20773
15             499      24.5297   19.36473
16             500      24.5297   19.36473
17             501      24.4097   19.36473
18             563      24.4097   19.36475
19             564      24.3897   19.36475
20             999      24.3897   19.36487
21            1000      24.3097   19.36487
22            1001      24.1897   19.36487
23            1449      24.1897   19.36499, [...]]
[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0                0      25.0101   19.15990
1                1      24.9741   19.16390
2                2      24.9741   19.18510
3               85      24.9741   19.18512
4               86      24.9736   19.20762
5               99      24.9736   19.20763
6              100      24.6197   19.20763
7              101      24.5697   19.20763
8              300      24.5697   19.20768
9              301      24.5697   19.20768
10             363      24.5697   19.20770
11             364      24.5497   19.20770
12             400      24.5497   19.20771
13             401      24.5297   19.20771
14             498      24.5297   19.20773
15             499      24.5297   19.36473
16             500      24.5297   19.36473
17             501      24.4097   19.36473
18             563      24.4097   19.36475
19             564      24.3897   19.36475
20             999      24.3897   19.36487
21            1000      24.3097   19.36487
22            1001      24.1897   19.36487
23            1449      24.1897   19.36499, [...]]

Answer 1

那是因為您在此處拉取初始 html source = BeautifulSoup(r.content,"lxml") ，然后繼續渲染該內容。

您需要為 go 到的每個頁面拉 html。 只需添加 1 行即可。 我評論了我添加它的地方：

import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd 
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"] 
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1) 
tomorrow = str(tomorrow)
words = tarih.split('.')  
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
    source = BeautifulSoup(driver.page_source,"lxml")  #<-- get the current html
    tablo = source.find_all("table")#sayfadaki tablo 
    dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
    dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
    print(dfs)
    return tablo 
if tomorrow == yeni_tarih :
    print(yeni_tarih == tomorrow)
    driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
    driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
    time.sleep(1)
    driver.find_element_by_xpath("//select/option[@value='96']").click()
    time.sleep(1)
    user = driver.find_element_by_name("j_idt206:txt1")
    nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")
    num=0
    
    tablo_cek() #<-- need to get that data before moving to next page
    while num < 24 :
        user.send_keys(num) #saate veri gönder 
        driver.find_element_by_id('j_idt206:goster').click() #saati uygula
        nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
        nextpage.click() #next page e geç 
        user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini 
        time.sleep(1)
        tablo_cek()
        num = num + 1 #saati bir arttır
        user.clear() #saati sıfırla
else:
    print("Güncelleme gelmedi")

Output：

True
[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0                0     25.11370   17.70810
1                1     25.07769   17.71210
2                2     25.07767   17.72310
3               85     25.07657   17.72312
4               86     25.07605   17.74612
..             ...          ...        ...
91           10000     23.97000   17.97907
92           10001     23.91500   17.97907
93           10014     23.91500   17.97907
94           10015     23.91500   17.97907
95           10100     23.91499   17.97909

[96 rows x 3 columns], [...]]
[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0            10101     23.91499   18.04009
1            10440     23.91497   18.04015
2            10999     23.91493   18.04025
3            11000     23.89993   18.04025
4            11733     23.89988   18.04039
..             ...          ...        ...
91           23999     23.55087   19.40180
92           24000     23.55087   19.40200
93           24001     23.53867   19.40200
94           24221     23.53863   19.40200
95           24222     23.53863   19.40200

[96 rows x 3 columns], [...]]
[    Fiyat (TL/MWh)  Talep (MWh)  Arz (MWh)
0            24360     21.33871    19.8112
1            24499     21.33868    19.8112
2            24500     21.33868    19.8112
3            24574     21.33867    19.8112
4            24575     21.33867    19.8112
..             ...          ...        ...
91           29864     21.18720    20.3708
92           29899     21.18720    20.3708
93           29900     21.18720    20.3808
94           29999     21.18720    20.3808
95           30000     21.18530    20.3811

[96 rows x 3 columns], [...]]

Answer 2

我還將提供另一種解決方案，因為您可以直接從請求中提取數據。 它還為您提供了每頁拉多少的選項（並且您可以遍歷每個頁面），但是，如果您將該限制設置得足夠高，您可以在 1 個請求中獲得所有內容。 所以大約有 400+ 行，我將限制設置為 1000，那么你只需要第 0 頁：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}

page = '0'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'j_idt206:dt',
'javax.faces.partial.execute': 'j_idt206:dt',
'javax.faces.partial.render': 'j_idt206:dt',
'j_idt206:dt': 'j_idt206:dt',
'j_idt206:dt_pagination': 'true',
'j_idt206:dt_first': page,
'j_idt206:dt_rows': '1000',
'j_idt206:dt_skipChildren': 'true',
'j_idt206:dt_encodeFeature': 'true',
'j_idt206': 'j_idt206',
'j_idt206:date1_input': '04.02.2021',
'j_idt206:txt1': '0',
'j_idt206:dt_rppDD': '1000'
}

rows = []
hours = list(range(0,24))
for hour in hours:
    payload.update({'j_idt206:txt1':str(hour)})
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text.replace('![CDATA[',''), 'lxml')
    columns = ['Fiyat (TL/MWh)',    'Talep (MWh)',  'Arz (MWh)', 'hour']
    
    trs = soup.find_all('tr')
    for row in trs:
        data = row.find_all('td')
        data = [x.text for x in data] + [str(hour)]
        rows.append(data)

df = pd.DataFrame(rows, columns=columns)

Output：

print(df)
    Fiyat (TL/MWh) Talep (MWh)  Arz (MWh)
0             0,00   25.113,70  17.708,10
1             0,01   25.077,69  17.712,10
2             0,02   25.077,67  17.723,10
3             0,85   25.076,57  17.723,12
4             0,86   25.076,05  17.746,12
..             ...         ...        ...
448         571,01   19.317,10  29.529,60
449         571,80   19.316,86  29.529,60
450         571,90   19.316,83  29.529,70
451         571,99   19.316,80  29.529,70
452         572,00   19.316,80  29.540,70

[453 rows x 3 columns]

要找到這一點只需要一點調查工作。 如果您將 go 轉到開發工具 -> 網絡 -> XHR，則嘗試查看數據是否嵌入在這些請求中（見圖）。 如果你在那里找到它，go 到Headers選項卡，你可以在底部獲得 url 和參數。

大多數情況下，您會看到數據以漂亮的 json 格式返回。 不是這里的情況。 它以與 xml 略有不同的方式返回，因此需要額外的工作來拉出標簽等。 但並非不可能。

如何使用 Selenium,Beautiful Soup,Pandas 從網站的多個頁面中提取實際數據？

問題描述

2 個解決方案

解決方案1
1 2021-02-03 11:09:22

解決方案2
1 已采納 2021-02-03 11:31:14

如何使用 Selenium,Beautiful Soup,Pandas 從網站的多個頁面中提取實際數據？

問題描述

2 個解決方案

解決方案1 1 2021-02-03 11:09:22

解決方案2 1 已采納 2021-02-03 11:31:14

解決方案1
1 2021-02-03 11:09:22

解決方案2
1 已采納 2021-02-03 11:31:14