[英]web scraping taking too long and no output in python
我目前正在嘗試抓取電廠數據。 附上我的代碼如下所示:
#Import packages
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv
#For loop to scrap details of power plants
gas_lst=[]
for i in range(1,46624):
pid=str(i)
url="http://www.globalenergyobservatory.com/form.php?pid=" + pid
page=urllib.request.urlopen(url)
soup=BeautifulSoup(page,'html.parser')
#Distinguish power plants to different types of primary fuel
types=soup.find(id="Type")
power_types=types["value"]
###Breakdown of different units
if power_types=="Gas":
i = 1
while True:
if soup.find(id="unitcheck" + str(i)) == None:
break
else:
gas_unit=soup.find(id="unitcheck" + str(i))
gas_unit_values=gas_unit["value"]
gas_capacity=soup.find(id="Capacity_(MWe)_nbr_" + str(i))
gas_capacity_values=gas_capacity["value"]
gas_commissioned=soup.find(id="Date_Commissioned_dt_" + str(i))
gas_commissioned_date=gas_commissioned["value"]
gas_decommissioned=soup.find(id="Decommission_Date_dt_" + str(i))
gas_decommissioned_date=gas_decommissioned["value"]
gas_HRSG=soup.find(id="Boiler/HRSG_Manufacturer_" + str(i))
gas_HRSG_OEM=gas_HRSG["value"]
gas_turbine=soup.find(id="Turbine_Manufacturer_" + str(i))
gas_turbine_OEM=gas_turbine["value"]
gas_generator=soup.find(id="Generator_Manufacturer_" + str(i))
gas_generator_OEM=gas_generator["value"]
i = i+1
else:
continue
#Gas units breakdowns
gas_lst.append([gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM])
gas_df=pd.DataFrame(gas_lst)
gas_df.columns=['Unit','Capacity','Date_commissioned','Date_decommissioned','HRSG_manufacturer','Turbine_manufacturer','Generator_manufacturer']
print(pid)
#Convert to csv file
gas_df.to_csv('gas_units_breakdowns.csv',index=False)
但是,該過程耗時太長,而且似乎根本沒有任何輸出。 我想知道是因為我的代碼錯誤嗎? 任何幫助深表感謝。
如果直接去使用天然氣類型的工廠,您將獲得更好(更快)的結果,而不是先檢查每個工廠,然后再查看是否為天然氣。
您可以通過使用以下參數來獲取煤氣廠列表:
payload = {
'op': 'menu_name',
'type': '2',
'cnt': '226',
'st': '0'}
這將從必須搜索46,000多個請求減少到1384個Gas,並擺脫了if power_grids == "Gas"
,因為我們已經知道我們擁有所有的"Gas"
您的代碼似乎也陷入了while True
困境。 那是因為您在循環之外增加了i
,所以我解決了這個問題。
最后,我沒有檢查您在其中抓取的數據(例如gas_unit_values,gas_capacity等)。 看起來其中一些值確實為空。 因此,我將其留給您解決,因為這段代碼至少應該使您克服第一個障礙。
#Import packages
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = 'http://www.globalenergyobservatory.com/geoutility.php'
payload = {
'op': 'menu_name',
'type': '2',
'cnt': '226',
'st': '0'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
links = []
gas_links = soup.find_all('a')
for ele in gas_links:
link = ele['href']
links.append(link)
gas_df = pd.DataFrame()
for pid in links:
url="http://www.globalenergyobservatory.com" + pid
page=requests.get(url)
soup=BeautifulSoup(page.text,'html.parser')
#Distinguish power plants to different types of primary fuel
types=soup.find(id="Type")
power_types=types["value"]
###Breakdown of different units
i = 1
while True:
if soup.find(id="unitcheck" + str(i)) == None:
break
else:
gas_unit=soup.find(id="unitcheck" + str(i))
gas_unit_values=gas_unit["value"]
gas_capacity=soup.find(id="Capacity_(MWe)_nbr_" + str(i))
gas_capacity_values=gas_capacity["value"]
gas_commissioned=soup.find(id="Date_Commissioned_dt_" + str(i))
gas_commissioned_date=gas_commissioned["value"]
gas_decommissioned=soup.find(id="Decommission_Date_dt_" + str(i))
gas_decommissioned_date=gas_decommissioned["value"]
gas_HRSG=soup.find(id="Boiler/HRSG_Manufacturer_" + str(i))
gas_HRSG_OEM=gas_HRSG["value"]
gas_turbine=soup.find(id="Turbine_Manufacturer_" + str(i))
gas_turbine_OEM=gas_turbine["value"]
gas_generator=soup.find(id="Generator_Manufacturer_" + str(i))
gas_generator_OEM=gas_generator["value"]
temp_df = pd.DataFrame([[gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]],
columns = ['Unit','Capacity','Date_commissioned','Date_decommissioned','HRSG_manufacturer','Turbine_manufacturer','Generator_manufacturer'])
gas_df = gas_df.append(temp_df)
i = i+1
print('%s of %s complete: pid: %s' %(links.index(pid)+1, len(links), pid.split('=')[-1]))
#Gas units breakdowns
gas_df = gas_df.reset_index(drop=True)
#Convert to csv file
gas_df.to_csv('C:/gas_units_breakdowns.csv',index=False)
我建議多處理。 您的計算機在等待服務器響應每個請求時實際上處於空閑狀態。 根據我要刮刮的服務器,利用多處理可以看到10倍至20倍的加速。
首先,將您的循環轉換為以url作為參數並返回的函數: [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]
。
這是它的外觀概述
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv
from multiprocessing.dummy import Pool
def scrape_gas_data(url):
# your main code here
return [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]
url_list = ["http://www.globalenergyobservatory.com/form.php?pid={}".format(i) for i in range(1,46624)]
# Since http requests can sit idle for some time, you might be able to get away
# with passing a large number to pool (say 50) even though your machine probably
# can't run 50 threads at once
my_pool = Pool()
my_pool.map(scrape_gas_data, url_list)
BeautifulSoup文檔提到lxml
解析器比html.parser
快。 我不確定這是否是限速步驟,但是由於更改解析器通常很容易,所以我也將提及這一點。
另外,正如關於最佳做法的說明一樣,您正在循環內重新分配變量i
,這不是很干凈。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.