[英]Do I have enough data to run reliable analysis?
I have an NBA game data set with games from 2012-13/2016-17 including playoffs labeled separately as 2013-2017 Playoffs with a little over 6000+ rows that I've been building that looks like this. 我有一个包含2012-13 / 2016-17赛季比赛数据的NBA游戏数据集,其中包括我一直在构建的带有6000余行的季后赛标记为2013-2017季后赛,看起来像这样。
Date Visitor V_PTS Home \
25 2012-11-03 19:00:00 Sacramento Kings 98 Indiana Pacers
26 2012-11-03 19:00:00 New Orleans Pelicans 89 Chicago Bulls
27 2012-11-03 19:00:00 Boston Celtics 89 Washington Wizards
28 2012-11-03 19:00:00 Portland Trail Blazers 95 Houston Rockets
29 2012-11-03 19:30:00 Toronto Raptors 100 Brooklyn Nets
30 2012-11-03 19:30:00 Charlotte Hornets 99 Dallas Mavericks
31 2012-11-03 19:30:00 Golden State Warriors 114 Los Angeles Clippers
H_PTS Attendance Arena Location \
25 106 18165 Bankers Life Fieldhouse Indianapolis, Indiana
26 82 21758 United Center Chicago, Illinois
27 86 20308 Capital One Arena Washington, D.C.
28 85 18140 Toyota Center Houston, Texas
29 107 17732 Barclays Center Brooklyn, New York
30 126 19490 American Airlines Center Dallas, Texas
31 110 19060 Staples Center Los Angeles, California
Capacity Yr Arena Opened Season H_Allstars V_Allstars V_wins \
25 17923 1999 2012-13 1 0 0
26 20917 1994 2012-13 2 0 1
27 20356 1997 2012-13 0 2 0
28 18055 2003 2012-13 1 1 1
29 17732 2012 2012-13 1 0 0
30 19200 2001 2012-13 0 0 1
31 19060 1999 2012-13 2 1 1
V_losses H_wins H_losses V_WPercent H_WPercent
25 2 1 1 0.0 0.5
26 1 2 0 0.5 1.0
27 2 0 1 0.0 0.0
28 1 2 0 0.5 1.0
29 1 0 0 0.0 0.0
30 0 1 1 1.0 0.5
31 1 2 0 0.5 1.0
I'm not trying to do anything to intense, but am trying to find what influences/predict NBA attendance for teams with reasonable amount of accuracy. 我并不是想做任何激烈的事情,而是试图找出对合理的准确性有影响/预测球队出勤率的因素。 Are their any other predictor variables you'd recommend adding?
您是否建议添加其他任何预测变量? How would you suggest going about exploring the data to gain insight and what type if any ML packages might be useful.
您将如何建议探索数据以获取见识,以及如果有任何ML软件包可能有用的类型,您将如何建议。 My very first personal project so any all advice/examples is greatly appreciated.
我的第一个个人项目,因此感谢所有建议/示例。
Update: 更新:
After some quick exploration and browsing some step by step online tutorials I came across this correlation matrix on capacity percentage filled. 经过一番快速探索和逐步浏览之后,我遇到了有关容量百分比填充的相关矩阵。 I may be wrong, but do these numbers seem low?
我可能是错的,但是这些数字似乎很低吗? Is each team too specific and I may need to focus on teams that don't sell out consistently or is there some variable that I should think about including that may have something to do with past attendance history?
每个团队是否都太具体了,我可能需要关注那些没有持续售罄的团队,或者我是否应该考虑一些变量,这可能与过去的出勤历史有关?
A few features I would consider adding: 我会考虑添加一些功能:
Home and Visitor Superstars. 家庭和访客超级巨星。 I know you have all stars but superstars are a bit different.
我知道您拥有所有的恒星,但是超级巨星则有些不同。 Example: Lebron vs. Demar Derozan.
例如:勒布朗对德玛·德罗赞(Demar Derozan)。 Both are super stars but many more people are likely to go see Lebron over Demar.
他们俩都是超级巨星,但是更多的人可能会去看勒布朗而不是戴玛。 It would, however, be quite a pain to add all of this and also to determine who is a superstar.
但是,将所有这些加在一起并确定谁是超级巨星将是非常痛苦的。
You may want to consider rivalries. 您可能要考虑竞争。 Boston vs Los Angeles (Lakers) is always a sell out crowd due to the history of the teams.
由于各支球队的历史,波士顿vs洛杉矶(湖人)一直都是抢购对象。
The number of season ticket holders may or may not affect the count. 季票持有者的数量可能会或可能不会影响计数。 I'd be interested to see if that held any weight.
我很想看看那是否有分量。
Some of these aren't too important. 其中一些不太重要。 I would consider getting rid of the game score since that is what happens at the end of the game and attendance is related to ticket sales (before the game starts).
我会考虑摆脱比赛成绩,因为那是比赛结束时发生的事情,而出勤率与门票销售有关(比赛开始之前)。 It looks like you do have a decent amount of data to work with.
看起来您确实有大量的数据可以使用。 Obviously its going to be a supervised model.
显然,它将成为监督模型。 Looks like regression will be something you will want to use for your model.
看起来回归将是您想要用于模型的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.