简体   繁体   中英

Do I have enough data to run reliable analysis?

I have an NBA game data set with games from 2012-13/2016-17 including playoffs labeled separately as 2013-2017 Playoffs with a little over 6000+ rows that I've been building that looks like this.

                  Date                 Visitor  V_PTS                  Home  \
25 2012-11-03 19:00:00        Sacramento Kings     98        Indiana Pacers   
26 2012-11-03 19:00:00    New Orleans Pelicans     89         Chicago Bulls   
27 2012-11-03 19:00:00          Boston Celtics     89    Washington Wizards   
28 2012-11-03 19:00:00  Portland Trail Blazers     95       Houston Rockets   
29 2012-11-03 19:30:00         Toronto Raptors    100         Brooklyn Nets   
30 2012-11-03 19:30:00       Charlotte Hornets     99      Dallas Mavericks   
31 2012-11-03 19:30:00   Golden State Warriors    114  Los Angeles Clippers   

    H_PTS  Attendance                     Arena                 Location  \
25    106       18165   Bankers Life Fieldhouse    Indianapolis, Indiana   
26     82       21758             United Center        Chicago, Illinois   
27     86       20308         Capital One Arena         Washington, D.C.   
28     85       18140             Toyota Center           Houston, Texas   
29    107       17732           Barclays Center       Brooklyn, New York   
30    126       19490  American Airlines Center            Dallas, Texas   
31    110       19060            Staples Center  Los Angeles, California   

    Capacity Yr Arena Opened   Season  H_Allstars  V_Allstars  V_wins  \
25     17923            1999  2012-13           1           0       0   
26     20917            1994  2012-13           2           0       1   
27     20356            1997  2012-13           0           2       0   
28     18055            2003  2012-13           1           1       1   
29     17732            2012  2012-13           1           0       0   
30     19200            2001  2012-13           0           0       1   
31     19060            1999  2012-13           2           1       1   

    V_losses  H_wins  H_losses  V_WPercent  H_WPercent  
25         2       1         1         0.0         0.5  
26         1       2         0         0.5         1.0  
27         2       0         1         0.0         0.0  
28         1       2         0         0.5         1.0  
29         1       0         0         0.0         0.0  
30         0       1         1         1.0         0.5  
31         1       2         0         0.5         1.0 

I'm not trying to do anything to intense, but am trying to find what influences/predict NBA attendance for teams with reasonable amount of accuracy. Are their any other predictor variables you'd recommend adding? How would you suggest going about exploring the data to gain insight and what type if any ML packages might be useful. My very first personal project so any all advice/examples is greatly appreciated.

Update:

After some quick exploration and browsing some step by step online tutorials I came across this correlation matrix on capacity percentage filled. I may be wrong, but do these numbers seem low? Is each team too specific and I may need to focus on teams that don't sell out consistently or is there some variable that I should think about including that may have something to do with past attendance history?

容量百分比的相关矩阵

A few features I would consider adding:

  1. Home and Visitor Superstars. I know you have all stars but superstars are a bit different. Example: Lebron vs. Demar Derozan. Both are super stars but many more people are likely to go see Lebron over Demar. It would, however, be quite a pain to add all of this and also to determine who is a superstar.

  2. You may want to consider rivalries. Boston vs Los Angeles (Lakers) is always a sell out crowd due to the history of the teams.

  3. The number of season ticket holders may or may not affect the count. I'd be interested to see if that held any weight.

Some of these aren't too important. I would consider getting rid of the game score since that is what happens at the end of the game and attendance is related to ticket sales (before the game starts). It looks like you do have a decent amount of data to work with. Obviously its going to be a supervised model. Looks like regression will be something you will want to use for your model.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM