[英]How to pulling actual data from multiple pages of website with using Selenium,Beautiful Soup ,Pandas?
I am new for pulling data using Python.我是使用 Python 提取数据的新手。 I want to do excel file as pulling tables from website.
我想做 excel 文件作为从网站拉表。
The website url : "https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml"
In this webpage,there are tables at seperately pages for hours data.Due to one hour includes around 500 datas so pages are divided.在这个网页中,小时数据的表格在单独的页面中。由于一小时包括大约500个数据,因此页面被划分。
I want to pull all data for each hour.我想每小时提取所有数据。 But my mistake is pulling same table even if page changes.
但我的错误是即使页面发生变化也拉同一张桌子。
I am using beautiful soup,pandas,selenium libraries.我正在使用漂亮的汤,pandas,selenium 库。 I will show you my codes for explaning myself.
我将向您展示我解释自己的代码。
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/Users/tugba.ozkan/AppData/Local/SeleniumBasic/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[@value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")
num=0
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
In this situation:在这个情况下:
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click()
when python clicks the button to go to next page,the next page show then it needs to pull next table as shown table.当 python 点击按钮到 go 到下一页时,下一页显示然后它需要拉下一张表,如图所示。 But it doesn't work.
但它不起作用。 At the output I saw appended table that is same values.Like this: This is my output:
在 output 我看到附加表是相同的值。像这样:这是我的 output:
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
That's because you pull the initial html here source = BeautifulSoup(r.content,"lxml")
, and then keep rendering that content.那是因为您在此处拉取初始 html
source = BeautifulSoup(r.content,"lxml")
,然后继续渲染该内容。
You need to pull the html for each page that you go to.您需要为 go 到的每个页面拉 html。 It's just a matter of adding 1 line.
只需添加 1 行即可。 I commented where I added it:
我评论了我添加它的地方:
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
source = BeautifulSoup(driver.page_source,"lxml") #<-- get the current html
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[@value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")
num=0
tablo_cek() #<-- need to get that data before moving to next page
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[@class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
Output: Output:
True
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.11370 17.70810
1 1 25.07769 17.71210
2 2 25.07767 17.72310
3 85 25.07657 17.72312
4 86 25.07605 17.74612
.. ... ... ...
91 10000 23.97000 17.97907
92 10001 23.91500 17.97907
93 10014 23.91500 17.97907
94 10015 23.91500 17.97907
95 10100 23.91499 17.97909
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 10101 23.91499 18.04009
1 10440 23.91497 18.04015
2 10999 23.91493 18.04025
3 11000 23.89993 18.04025
4 11733 23.89988 18.04039
.. ... ... ...
91 23999 23.55087 19.40180
92 24000 23.55087 19.40200
93 24001 23.53867 19.40200
94 24221 23.53863 19.40200
95 24222 23.53863 19.40200
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 24360 21.33871 19.8112
1 24499 21.33868 19.8112
2 24500 21.33868 19.8112
3 24574 21.33867 19.8112
4 24575 21.33867 19.8112
.. ... ... ...
91 29864 21.18720 20.3708
92 29899 21.18720 20.3708
93 29900 21.18720 20.3808
94 29999 21.18720 20.3808
95 30000 21.18530 20.3811
[96 rows x 3 columns], [...]]
I will also offer up another solution as you can pull that data directly from the requests.我还将提供另一种解决方案,因为您可以直接从请求中提取数据。 It also gives you the option of how many to pull per page (and you can iterate through each page), however, if you set that limit high enough, you can get it all in 1 request.
它还为您提供了每页拉多少的选项(并且您可以遍历每个页面),但是,如果您将该限制设置得足够高,您可以在 1 个请求中获得所有内容。 So there are about 400+ rows, I set the limit to 1000, then you only need page 0:
所以大约有 400+ 行,我将限制设置为 1000,那么你只需要第 0 页:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
page = '0'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'j_idt206:dt',
'javax.faces.partial.execute': 'j_idt206:dt',
'javax.faces.partial.render': 'j_idt206:dt',
'j_idt206:dt': 'j_idt206:dt',
'j_idt206:dt_pagination': 'true',
'j_idt206:dt_first': page,
'j_idt206:dt_rows': '1000',
'j_idt206:dt_skipChildren': 'true',
'j_idt206:dt_encodeFeature': 'true',
'j_idt206': 'j_idt206',
'j_idt206:date1_input': '04.02.2021',
'j_idt206:txt1': '0',
'j_idt206:dt_rppDD': '1000'
}
rows = []
hours = list(range(0,24))
for hour in hours:
payload.update({'j_idt206:txt1':str(hour)})
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text.replace('![CDATA[',''), 'lxml')
columns = ['Fiyat (TL/MWh)', 'Talep (MWh)', 'Arz (MWh)', 'hour']
trs = soup.find_all('tr')
for row in trs:
data = row.find_all('td')
data = [x.text for x in data] + [str(hour)]
rows.append(data)
df = pd.DataFrame(rows, columns=columns)
Output: Output:
print(df)
Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0,00 25.113,70 17.708,10
1 0,01 25.077,69 17.712,10
2 0,02 25.077,67 17.723,10
3 0,85 25.076,57 17.723,12
4 0,86 25.076,05 17.746,12
.. ... ... ...
448 571,01 19.317,10 29.529,60
449 571,80 19.316,86 29.529,60
450 571,90 19.316,83 29.529,70
451 571,99 19.316,80 29.529,70
452 572,00 19.316,80 29.540,70
[453 rows x 3 columns]
To find this just takes a little investigative work.要找到这一点只需要一点调查工作。 If you go to Dev Tools -> Network -> XHR, you try to see if the data is somewhere embedded in those requests (see image).
如果您将 go 转到开发工具 -> 网络 -> XHR,则尝试查看数据是否嵌入在这些请求中(见图)。 If you find it there, go to
Headers
tab and you can get the url and parameters at the bottom.如果你在那里找到它,go 到
Headers
选项卡,你可以在底部获得 url 和参数。
MOST cases you'll see the data is returned in a nice json format.大多数情况下,您会看到数据以漂亮的 json 格式返回。 Not the case here.
不是这里的情况。 It was returned in a slightly different way with xml, so need a tad extra work to pull out the tags and such.
它以与 xml 略有不同的方式返回,因此需要额外的工作来拉出标签等。 But not impossible.
但并非不可能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.