简体   繁体   中英

web scraping taking too long and no output in python

I am currently trying to scrape power plant data. Attached is my code is shown below:

#Import packages
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv


#For loop to scrap details of power plants
gas_lst=[]

for i in range(1,46624):
    pid=str(i)
    url="http://www.globalenergyobservatory.com/form.php?pid=" + pid
    page=urllib.request.urlopen(url)
    soup=BeautifulSoup(page,'html.parser')

    #Distinguish power plants to different types of primary fuel
    types=soup.find(id="Type")
    power_types=types["value"]

    ###Breakdown of different units
    if power_types=="Gas":

        i = 1

        while True:
            if soup.find(id="unitcheck" + str(i)) == None:
                break
            else:
                gas_unit=soup.find(id="unitcheck" + str(i))
                gas_unit_values=gas_unit["value"]
                gas_capacity=soup.find(id="Capacity_(MWe)_nbr_" + str(i))
                gas_capacity_values=gas_capacity["value"]
                gas_commissioned=soup.find(id="Date_Commissioned_dt_" + str(i))
                gas_commissioned_date=gas_commissioned["value"]
                gas_decommissioned=soup.find(id="Decommission_Date_dt_" + str(i))
                gas_decommissioned_date=gas_decommissioned["value"]
                gas_HRSG=soup.find(id="Boiler/HRSG_Manufacturer_" + str(i))
                gas_HRSG_OEM=gas_HRSG["value"]
                gas_turbine=soup.find(id="Turbine_Manufacturer_" + str(i))
                gas_turbine_OEM=gas_turbine["value"]
                gas_generator=soup.find(id="Generator_Manufacturer_" + str(i))
                gas_generator_OEM=gas_generator["value"]

        i = i+1

    else:
        continue


    #Gas units breakdowns
    gas_lst.append([gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM])
    gas_df=pd.DataFrame(gas_lst)
    gas_df.columns=['Unit','Capacity','Date_commissioned','Date_decommissioned','HRSG_manufacturer','Turbine_manufacturer','Generator_manufacturer']

    print(pid)


    #Convert to csv file
    gas_df.to_csv('gas_units_breakdowns.csv',index=False) 

However, the process is taking too long and there isn't seem to have any output at all. I wonder is it because my code is wrong? Any help is much appreciated.

You'll have better (and faster) results if you go straight for the gas type plants, as opposed to checking EVERY plant and then seeing if it's Gas or not.

You can get the list of gas plants by using these parameters:

payload = {
'op': 'menu_name',
'type': '2',
'cnt': '226',
'st': '0'}

This will cut down from having to search through 46,000+ requests, to the 1384 that are Gas, and got rid of the if power_grids == "Gas" , since we already know we have all the "Gas"

Your code also seems to get stuck in your while True . That is because you're increment i outside of that loop, so I fixed that.

Lastly, I didn't check your data that you scrape in there (ie. gas_unit_values, gas_capacity, etc.). It does look like some of those values are nulls. So I'll leave that for you to work through, as this code should at least get you over this first hurdle.

#Import packages
from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.globalenergyobservatory.com/geoutility.php'

payload = {
'op': 'menu_name',
'type': '2',
'cnt': '226',
'st': '0'}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}

response = requests.get(url, headers=headers, params=payload)

soup = BeautifulSoup(response.text, 'html.parser')

links = []
gas_links = soup.find_all('a')
for ele in gas_links:
    link = ele['href']
    links.append(link)


gas_df = pd.DataFrame()
for pid in links:

    url="http://www.globalenergyobservatory.com" + pid
    page=requests.get(url)
    soup=BeautifulSoup(page.text,'html.parser')

    #Distinguish power plants to different types of primary fuel
    types=soup.find(id="Type")
    power_types=types["value"]

    ###Breakdown of different units
    i = 1
    while True:
        if soup.find(id="unitcheck" + str(i)) == None:
            break
        else:
            gas_unit=soup.find(id="unitcheck" + str(i))
            gas_unit_values=gas_unit["value"]
            gas_capacity=soup.find(id="Capacity_(MWe)_nbr_" + str(i))
            gas_capacity_values=gas_capacity["value"]
            gas_commissioned=soup.find(id="Date_Commissioned_dt_" + str(i))
            gas_commissioned_date=gas_commissioned["value"]
            gas_decommissioned=soup.find(id="Decommission_Date_dt_" + str(i))
            gas_decommissioned_date=gas_decommissioned["value"]
            gas_HRSG=soup.find(id="Boiler/HRSG_Manufacturer_" + str(i))
            gas_HRSG_OEM=gas_HRSG["value"]
            gas_turbine=soup.find(id="Turbine_Manufacturer_" + str(i))
            gas_turbine_OEM=gas_turbine["value"]
            gas_generator=soup.find(id="Generator_Manufacturer_" + str(i))
            gas_generator_OEM=gas_generator["value"]

            temp_df = pd.DataFrame([[gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]],
                                   columns = ['Unit','Capacity','Date_commissioned','Date_decommissioned','HRSG_manufacturer','Turbine_manufacturer','Generator_manufacturer'])


            gas_df = gas_df.append(temp_df)

        i = i+1
    print('%s of %s complete: pid: %s' %(links.index(pid)+1, len(links), pid.split('=')[-1]))



#Gas units breakdowns
gas_df = gas_df.reset_index(drop=True)


#Convert to csv file
gas_df.to_csv('C:/gas_units_breakdowns.csv',index=False) 

I'd recommend multiprocessing. Your machine is essentially sitting idle while it waits for the server to respond to each request. Depending on what server I'm scraping, I can see 10x-20x speedups by utilizing multiprocessing.

First, I'd convert your loop into a function which takes a url as a parameter and returns: [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM] .

Here's an outline of what this might look like

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv
from multiprocessing.dummy import Pool

def scrape_gas_data(url):
    # your main code here
    return [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]

url_list = ["http://www.globalenergyobservatory.com/form.php?pid={}".format(i) for i in range(1,46624)]

# Since http requests can sit idle for some time, you might be able to get away
# with passing a large number to pool (say 50) even though your machine probably
# can't run 50 threads at once
my_pool = Pool()
my_pool.map(scrape_gas_data, url_list)

The BeautifulSoup documentation mentions that the lxml parser is faster than html.parser . I'm not sure that is the rate limiting step here, but since changing the parser is usually low hanging fruit, I'll mention that as well.

Also, just as a note on good practice, you're re-assigning the variable i inside the loop, which isn't so clean.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM