简体   繁体   English

网络抓取花费的时间太长,并且在python中没有输出

[英]web scraping taking too long and no output in python

I am currently trying to scrape power plant data. 我目前正在尝试抓取电厂数据。 Attached is my code is shown below: 附上我的代码如下所示:

#Import packages
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv


#For loop to scrap details of power plants
gas_lst=[]

for i in range(1,46624):
    pid=str(i)
    url="http://www.globalenergyobservatory.com/form.php?pid=" + pid
    page=urllib.request.urlopen(url)
    soup=BeautifulSoup(page,'html.parser')

    #Distinguish power plants to different types of primary fuel
    types=soup.find(id="Type")
    power_types=types["value"]

    ###Breakdown of different units
    if power_types=="Gas":

        i = 1

        while True:
            if soup.find(id="unitcheck" + str(i)) == None:
                break
            else:
                gas_unit=soup.find(id="unitcheck" + str(i))
                gas_unit_values=gas_unit["value"]
                gas_capacity=soup.find(id="Capacity_(MWe)_nbr_" + str(i))
                gas_capacity_values=gas_capacity["value"]
                gas_commissioned=soup.find(id="Date_Commissioned_dt_" + str(i))
                gas_commissioned_date=gas_commissioned["value"]
                gas_decommissioned=soup.find(id="Decommission_Date_dt_" + str(i))
                gas_decommissioned_date=gas_decommissioned["value"]
                gas_HRSG=soup.find(id="Boiler/HRSG_Manufacturer_" + str(i))
                gas_HRSG_OEM=gas_HRSG["value"]
                gas_turbine=soup.find(id="Turbine_Manufacturer_" + str(i))
                gas_turbine_OEM=gas_turbine["value"]
                gas_generator=soup.find(id="Generator_Manufacturer_" + str(i))
                gas_generator_OEM=gas_generator["value"]

        i = i+1

    else:
        continue


    #Gas units breakdowns
    gas_lst.append([gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM])
    gas_df=pd.DataFrame(gas_lst)
    gas_df.columns=['Unit','Capacity','Date_commissioned','Date_decommissioned','HRSG_manufacturer','Turbine_manufacturer','Generator_manufacturer']

    print(pid)


    #Convert to csv file
    gas_df.to_csv('gas_units_breakdowns.csv',index=False) 

However, the process is taking too long and there isn't seem to have any output at all. 但是,该过程耗时太长,而且似乎根本没有任何输出。 I wonder is it because my code is wrong? 我想知道是因为我的代码错误吗? Any help is much appreciated. 任何帮助深表感谢。

You'll have better (and faster) results if you go straight for the gas type plants, as opposed to checking EVERY plant and then seeing if it's Gas or not. 如果直接去使用天然气类型的工厂,您将获得更好(更快)的结果,而不是先检查每个工厂,然后再查看是否为天然气。

You can get the list of gas plants by using these parameters: 您可以通过使用以下参数来获取煤气厂列表:

payload = {
'op': 'menu_name',
'type': '2',
'cnt': '226',
'st': '0'}

This will cut down from having to search through 46,000+ requests, to the 1384 that are Gas, and got rid of the if power_grids == "Gas" , since we already know we have all the "Gas" 这将从必须搜索46,000多个请求减少到1384个Gas,并摆脱了if power_grids == "Gas" ,因为我们已经知道我们拥有所有的"Gas"

Your code also seems to get stuck in your while True . 您的代码似乎也陷入了while True困境。 That is because you're increment i outside of that loop, so I fixed that. 那是因为您在循环之外增加了i ,所以我解决了这个问题。

Lastly, I didn't check your data that you scrape in there (ie. gas_unit_values, gas_capacity, etc.). 最后,我没有检查您在其中抓取的数据(例如gas_unit_values,gas_capacity等)。 It does look like some of those values are nulls. 看起来其中一些值确实为空。 So I'll leave that for you to work through, as this code should at least get you over this first hurdle. 因此,我将其留给您解决,因为这段代码至少应该使您克服第一个障碍。

#Import packages
from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.globalenergyobservatory.com/geoutility.php'

payload = {
'op': 'menu_name',
'type': '2',
'cnt': '226',
'st': '0'}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}

response = requests.get(url, headers=headers, params=payload)

soup = BeautifulSoup(response.text, 'html.parser')

links = []
gas_links = soup.find_all('a')
for ele in gas_links:
    link = ele['href']
    links.append(link)


gas_df = pd.DataFrame()
for pid in links:

    url="http://www.globalenergyobservatory.com" + pid
    page=requests.get(url)
    soup=BeautifulSoup(page.text,'html.parser')

    #Distinguish power plants to different types of primary fuel
    types=soup.find(id="Type")
    power_types=types["value"]

    ###Breakdown of different units
    i = 1
    while True:
        if soup.find(id="unitcheck" + str(i)) == None:
            break
        else:
            gas_unit=soup.find(id="unitcheck" + str(i))
            gas_unit_values=gas_unit["value"]
            gas_capacity=soup.find(id="Capacity_(MWe)_nbr_" + str(i))
            gas_capacity_values=gas_capacity["value"]
            gas_commissioned=soup.find(id="Date_Commissioned_dt_" + str(i))
            gas_commissioned_date=gas_commissioned["value"]
            gas_decommissioned=soup.find(id="Decommission_Date_dt_" + str(i))
            gas_decommissioned_date=gas_decommissioned["value"]
            gas_HRSG=soup.find(id="Boiler/HRSG_Manufacturer_" + str(i))
            gas_HRSG_OEM=gas_HRSG["value"]
            gas_turbine=soup.find(id="Turbine_Manufacturer_" + str(i))
            gas_turbine_OEM=gas_turbine["value"]
            gas_generator=soup.find(id="Generator_Manufacturer_" + str(i))
            gas_generator_OEM=gas_generator["value"]

            temp_df = pd.DataFrame([[gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]],
                                   columns = ['Unit','Capacity','Date_commissioned','Date_decommissioned','HRSG_manufacturer','Turbine_manufacturer','Generator_manufacturer'])


            gas_df = gas_df.append(temp_df)

        i = i+1
    print('%s of %s complete: pid: %s' %(links.index(pid)+1, len(links), pid.split('=')[-1]))



#Gas units breakdowns
gas_df = gas_df.reset_index(drop=True)


#Convert to csv file
gas_df.to_csv('C:/gas_units_breakdowns.csv',index=False) 

I'd recommend multiprocessing. 我建议多处理。 Your machine is essentially sitting idle while it waits for the server to respond to each request. 您的计算机在等待服务器响应每个请求时实际上处于空闲状态。 Depending on what server I'm scraping, I can see 10x-20x speedups by utilizing multiprocessing. 根据我要刮刮的服务器,利用多处理可以看到10倍至20倍的加速。

First, I'd convert your loop into a function which takes a url as a parameter and returns: [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM] . 首先,将您的循环转换为以url作为参数并返回的函数: [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]

Here's an outline of what this might look like 这是它的外观概述

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv
from multiprocessing.dummy import Pool

def scrape_gas_data(url):
    # your main code here
    return [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]

url_list = ["http://www.globalenergyobservatory.com/form.php?pid={}".format(i) for i in range(1,46624)]

# Since http requests can sit idle for some time, you might be able to get away
# with passing a large number to pool (say 50) even though your machine probably
# can't run 50 threads at once
my_pool = Pool()
my_pool.map(scrape_gas_data, url_list)

The BeautifulSoup documentation mentions that the lxml parser is faster than html.parser . BeautifulSoup文档提到lxml解析器比html.parser快。 I'm not sure that is the rate limiting step here, but since changing the parser is usually low hanging fruit, I'll mention that as well. 我不确定这是否是限速步骤,但是由于更改解析器通常很容易,所以我也将提及这一点。

Also, just as a note on good practice, you're re-assigning the variable i inside the loop, which isn't so clean. 另外,正如关于最佳做法的说明一样,您正在循环内重新分配变量i ,这不是很干净。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM