I have an NBA game data set with games from 2012-13/2016-17 including playoffs labeled separately as 2013-2017 Playoffs with a little over 6000+ rows that I've been building that looks like this.
Date Visitor V_PTS Home \
25 2012-11-03 19:00:00 Sacramento Kings 98 Indiana Pacers
26 2012-11-03 19:00:00 New Orleans Pelicans 89 Chicago Bulls
27 2012-11-03 19:00:00 Boston Celtics 89 Washington Wizards
28 2012-11-03 19:00:00 Portland Trail Blazers 95 Houston Rockets
29 2012-11-03 19:30:00 Toronto Raptors 100 Brooklyn Nets
30 2012-11-03 19:30:00 Charlotte Hornets 99 Dallas Mavericks
31 2012-11-03 19:30:00 Golden State Warriors 114 Los Angeles Clippers
H_PTS Attendance Arena Location \
25 106 18165 Bankers Life Fieldhouse Indianapolis, Indiana
26 82 21758 United Center Chicago, Illinois
27 86 20308 Capital One Arena Washington, D.C.
28 85 18140 Toyota Center Houston, Texas
29 107 17732 Barclays Center Brooklyn, New York
30 126 19490 American Airlines Center Dallas, Texas
31 110 19060 Staples Center Los Angeles, California
Capacity Yr Arena Opened Season H_Allstars V_Allstars V_wins \
25 17923 1999 2012-13 1 0 0
26 20917 1994 2012-13 2 0 1
27 20356 1997 2012-13 0 2 0
28 18055 2003 2012-13 1 1 1
29 17732 2012 2012-13 1 0 0
30 19200 2001 2012-13 0 0 1
31 19060 1999 2012-13 2 1 1
V_losses H_wins H_losses V_WPercent H_WPercent
25 2 1 1 0.0 0.5
26 1 2 0 0.5 1.0
27 2 0 1 0.0 0.0
28 1 2 0 0.5 1.0
29 1 0 0 0.0 0.0
30 0 1 1 1.0 0.5
31 1 2 0 0.5 1.0
I'm not trying to do anything to intense, but am trying to find what influences/predict NBA attendance for teams with reasonable amount of accuracy. Are their any other predictor variables you'd recommend adding? How would you suggest going about exploring the data to gain insight and what type if any ML packages might be useful. My very first personal project so any all advice/examples is greatly appreciated.
Update:
After some quick exploration and browsing some step by step online tutorials I came across this correlation matrix on capacity percentage filled. I may be wrong, but do these numbers seem low? Is each team too specific and I may need to focus on teams that don't sell out consistently or is there some variable that I should think about including that may have something to do with past attendance history?
A few features I would consider adding:
Home and Visitor Superstars. I know you have all stars but superstars are a bit different. Example: Lebron vs. Demar Derozan. Both are super stars but many more people are likely to go see Lebron over Demar. It would, however, be quite a pain to add all of this and also to determine who is a superstar.
You may want to consider rivalries. Boston vs Los Angeles (Lakers) is always a sell out crowd due to the history of the teams.
The number of season ticket holders may or may not affect the count. I'd be interested to see if that held any weight.
Some of these aren't too important. I would consider getting rid of the game score since that is what happens at the end of the game and attendance is related to ticket sales (before the game starts). It looks like you do have a decent amount of data to work with. Obviously its going to be a supervised model. Looks like regression will be something you will want to use for your model.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.