![](/img/trans.png)
[英]Get Dynamic Tabular from Website data using Selenium & Beautiful Soup
[英]How to pulling actual data from multiple pages of website with using Selenium,Beautiful Soup ,Pandas?
我是使用 Python 提取數據的新手。 我想做 excel 文件作為從網站拉表。
The website url : "https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml"
在這個網頁中,小時數據的表格在單獨的頁面中。由於一小時包括大約500個數據,因此頁面被划分。
我想每小時提取所有數據。 但我的錯誤是即使頁面發生變化也拉同一張桌子。
我正在使用漂亮的湯,pandas,selenium 庫。 我將向您展示我解釋自己的代碼。
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/Users/tugba.ozkan/AppData/Local/SeleniumBasic/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[@value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")
num=0
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
在這個情況下:
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click()
當 python 點擊按鈕到 go 到下一頁時,下一頁顯示然后它需要拉下一張表,如圖所示。 但它不起作用。 在 output 我看到附加表是相同的值。像這樣:這是我的 output:
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
那是因為您在此處拉取初始 html source = BeautifulSoup(r.content,"lxml")
,然后繼續渲染該內容。
您需要為 go 到的每個頁面拉 html。 只需添加 1 行即可。 我評論了我添加它的地方:
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
source = BeautifulSoup(driver.page_source,"lxml") #<-- get the current html
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[@value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")
num=0
tablo_cek() #<-- need to get that data before moving to next page
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
Output:
True
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.11370 17.70810
1 1 25.07769 17.71210
2 2 25.07767 17.72310
3 85 25.07657 17.72312
4 86 25.07605 17.74612
.. ... ... ...
91 10000 23.97000 17.97907
92 10001 23.91500 17.97907
93 10014 23.91500 17.97907
94 10015 23.91500 17.97907
95 10100 23.91499 17.97909
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 10101 23.91499 18.04009
1 10440 23.91497 18.04015
2 10999 23.91493 18.04025
3 11000 23.89993 18.04025
4 11733 23.89988 18.04039
.. ... ... ...
91 23999 23.55087 19.40180
92 24000 23.55087 19.40200
93 24001 23.53867 19.40200
94 24221 23.53863 19.40200
95 24222 23.53863 19.40200
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 24360 21.33871 19.8112
1 24499 21.33868 19.8112
2 24500 21.33868 19.8112
3 24574 21.33867 19.8112
4 24575 21.33867 19.8112
.. ... ... ...
91 29864 21.18720 20.3708
92 29899 21.18720 20.3708
93 29900 21.18720 20.3808
94 29999 21.18720 20.3808
95 30000 21.18530 20.3811
[96 rows x 3 columns], [...]]
我還將提供另一種解決方案,因為您可以直接從請求中提取數據。 它還為您提供了每頁拉多少的選項(並且您可以遍歷每個頁面),但是,如果您將該限制設置得足夠高,您可以在 1 個請求中獲得所有內容。 所以大約有 400+ 行,我將限制設置為 1000,那么你只需要第 0 頁:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
page = '0'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'j_idt206:dt',
'javax.faces.partial.execute': 'j_idt206:dt',
'javax.faces.partial.render': 'j_idt206:dt',
'j_idt206:dt': 'j_idt206:dt',
'j_idt206:dt_pagination': 'true',
'j_idt206:dt_first': page,
'j_idt206:dt_rows': '1000',
'j_idt206:dt_skipChildren': 'true',
'j_idt206:dt_encodeFeature': 'true',
'j_idt206': 'j_idt206',
'j_idt206:date1_input': '04.02.2021',
'j_idt206:txt1': '0',
'j_idt206:dt_rppDD': '1000'
}
rows = []
hours = list(range(0,24))
for hour in hours:
payload.update({'j_idt206:txt1':str(hour)})
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text.replace('![CDATA[',''), 'lxml')
columns = ['Fiyat (TL/MWh)', 'Talep (MWh)', 'Arz (MWh)', 'hour']
trs = soup.find_all('tr')
for row in trs:
data = row.find_all('td')
data = [x.text for x in data] + [str(hour)]
rows.append(data)
df = pd.DataFrame(rows, columns=columns)
Output:
print(df)
Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0,00 25.113,70 17.708,10
1 0,01 25.077,69 17.712,10
2 0,02 25.077,67 17.723,10
3 0,85 25.076,57 17.723,12
4 0,86 25.076,05 17.746,12
.. ... ... ...
448 571,01 19.317,10 29.529,60
449 571,80 19.316,86 29.529,60
450 571,90 19.316,83 29.529,70
451 571,99 19.316,80 29.529,70
452 572,00 19.316,80 29.540,70
[453 rows x 3 columns]
要找到這一點只需要一點調查工作。 如果您將 go 轉到開發工具 -> 網絡 -> XHR,則嘗試查看數據是否嵌入在這些請求中(見圖)。 如果你在那里找到它,go 到Headers
選項卡,你可以在底部獲得 url 和參數。
大多數情況下,您會看到數據以漂亮的 json 格式返回。 不是這里的情況。 它以與 xml 略有不同的方式返回,因此需要額外的工作來拉出標簽等。 但並非不可能。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.