繁体   English   中英

如何提高我的网页抓取脚本(Python 和 Selenium)的性能(运行时)

[英]How can I improve performance (runtime) on my webscraping script (Python and Selenium)

所以我写了一个脚本来在网站上刮一张桌子 - NFL 32 支球队的名册,超过 4 年。 然而,该网站一次只显示一个团队,而且是一年。 所以我的脚本打开页面,选择一年,抓取数据,然后转到下一年,依此类推,直到收集了所有四年的数据。 然后它对其他 32 个团队重复该过程。

现在,我是 web 抓取的新手,所以我不确定在计算上,我正在做的是 go 关于它的最佳方法。 目前,要为一个团队抓取一年的数据,大约需要 40-50 秒,因此每个团队总共大约需要 4 分钟。 要为所有团队收集所有年份,这需要两个多小时。

有没有办法抓取数据并减少运行时间?

代码如下:

import requests
import lxml.html as lh
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']

# Format list for URL
team_ls = [team.lower().replace(' ','-') for team in team_ls]

# Changes the year parameter on a given pages
def next_year(driver, year_idx):
    
    driver.find_element_by_xpath('//*[@id="main-dropdown"]').click()
    parentElement = driver.find_element_by_xpath('/html/body/app-root/app-nfl/app-roster/div/div/div[2]/div/div/div[1]/div/div/div')
    elementList = parentElement.find_elements_by_tag_name("button")
    elementList[year_idx].click()
    time.sleep(3)

# Create scraping function
def sel_scrape(driver, team, year):
    
    # Get main table
    main_table = driver.find_element_by_tag_name('table')
    
    # Scrape rows and header
    rows = [[td.text.strip() for td in row.find_elements_by_xpath(".//td")] for row in main_table.find_elements_by_xpath(".//tr")][1:]
    header = [[th.text.strip() for th in row.find_elements_by_xpath(".//th")] for row in main_table.find_elements_by_xpath(".//tr")][0]
    
    # compile in dataframe
    df=pd.DataFrame(rows,columns=header)
    
    # Edit data frame
    df['Merge Name'] = df['Name'].str.split(' ',1).str[0].str[0] + '.' + df['Name'].str.split(' ').str[1]
    df['Team'] = team.replace('-',' ').title()
    df['Year'] = year
    
    return df

url='https://www.lineups.com/nfl/roster/'

df = pd.DataFrame()
years = [2020,2019,2018,2017]

start_time = time.time()

for team in team_ls:
    driver = webdriver.Chrome()
    # Generate team link
    driver.get(url+team)
    
    # For each of the four years
    for idx in range(0,4):
        print("Starting {} {}".format(team, years[idx]))
        # Scrape the page
        df = pd.concat([df, sel_scrape(driver, team, years[idx])])
        
        # Change to next year
        next_year(driver, idx)
    driver. close()

print("--- %s seconds ---" % (time.time() - start_time))
    
df.head()

您可以通过不使用 Selenium 来改进。 Selenium(虽然它工作)自然会运行得更慢。 获取数据的最佳方式是通过 API 呈现该数据:

import pandas as pd
import requests
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']


rows = []
start_time = time.time()
for team in team_ls:
    for season in range(2017,2021):
        print ('Season: %s\tTeam: %s' %(season, team))
        teamStr = '-'.join(team.split()).lower()
        url= 'https://api.lineups.com/nfl/fetch/roster/{season}/{teamStr}'.format(season=season, teamStr=teamStr)

        jsonData = requests.get(url).json()
        roster = jsonData['data']
        for item in roster:
            item.update( {'Year':season, 'Team':team})
        rows += roster
        
df = pd.DataFrame(rows)

print("--- %s seconds ---" % (time.time() - start_time))

print (df.head())  

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM