简体   繁体   English

优化我的 Python Scraper

[英]Optimizing my Python Scraper

Kind of a long winded question and I probably just need someone to point me in the right direction.这是一个冗长的问题,我可能只需要有人为我指出正确的方向。 I'm building a web scraper to grab basketball player info from ESPN's website.我正在构建一个网络抓取工具来从 ESPN 的网站上获取篮球运动员的信息。 The URL structure is pretty simple in that each player card has a specific id in the URL. URL 结构非常简单,因为每个玩家卡在 URL 中都有一个特定的 id。 To obtain information I'm writing a loop from 1-~6000 to grab players from their database.为了获取信息,我正在编写一个 1-~6000 的循环来从他们的数据库中抓取玩家。 My question is whether there is a more efficient way of doing this?我的问题是是否有更有效的方法来做到这一点?

from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests 
import nltk
import re




age = [] # Empty List to store player ages

BASE = 'http://espn.go.com/nba/player/stats/_/id/' # Base Structure of Player Card URL
def get_age(BASE): #Creates a function
    #z = range(1,6000) # Create Range from 1 to 6000
    for i in range(1, 6000): # This is a for loop
        BASE_U = BASE + str(i) + '/' # Create URL For Player   
        r = requests.get(BASE_U)
        soup = BeautifulSoup(r.text)
        #Prior to this step, I had to print out the soup object and look through the HTML in order to find the tag that contained my desired information 
        # Get Age of Players        
        age_tables = soup.find_all('ul', class_="player-metadata") # Grabs all text in the metadata tag
        p = str(age_tables) # Turns text into a string
    #At this point I had to look at all the text in the p object and determine a way to capture the age info
        if "Age: " not in p: # PLayer ID doesn't exist so go to next to avoid error
        continue
        else:
            start = p.index("Age: ") + len("Age: ") # Gets the location of the players age 
            end = p[start:].index(")") + start  
            player_id.append(i) #Adds player_id to player_id list
            age.append(p[start:end]) # Adds player's age to age list

get_age(BASE)

Any help, even small, would be much appreciated.任何帮助,即使是很小的帮助,也将不胜感激。 Even if it's just pointing me in the right direction, and not necessarily a direct solution即使它只是为我指明了正确的方向,也不一定是直接的解决方案

Thanks, Ben谢谢,本

就像网络安全中的端口扫描器一样,多线程会大大加快您的编程速度。

Not only more efficient, but also a more organized and scalable approach would involve switching to Scrapy web-scraping framework.不仅更高效,而且更有条理和可扩展的方法将涉及切换到Scrapy网络抓取框架。

The main performance problem you have is because of the "blocking" nature of your current approach - Scrapy would solve it out-of-the-box because it is based on twisted and is completely asynchronous.您遇到的主要性能问题是由于您当前方法“阻塞”性质- Scrapy会开箱即Scrapy解决它,因为它基于twisted并且完全异步。

I'd probably start with http://espn.go.com/nba/players and use the following Regular Expression to get the Team Roster URLs...我可能会从http://espn.go.com/nba/players开始并使用以下正则表达式来获取团队名册 URL...

\href="(/nba/teams/roster\?team=[^"]+)">([^<]+)</a>\

Then I'd get the resulting match groups, where \\1 is the last portion of the URL and \\2 is the Team Name.然后我会得到结果匹配组,其中 \\1 是 URL 的最后一部分,\\2 是团队名称。 Then I'd use those URLs to scrape each team roster page looking for Player URLs...然后我会使用这些 URL 来抓取每个团队名册页面,寻找球员 URL...

\href="(http://espn.go.com/nba/player/_/id/[^"]+)">([^<]+)</a>\

I'd finally get the resulting match groups, where \\1 is the URL for the player page and \\2 is the Player Name.我最终得到了结果匹配组,其中 \\1 是播放器页面的 URL,而 \\2 是播放器名称。 I'd scrape each resulting URL for the info I needed.我会为我需要的信息抓取每个结果 URL。

Regular Expressions are the bomb.正则表达式是炸弹。

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM