我是否有足够的数据来进行可靠的分析？

Question

I have an NBA game data set with games from 2012-13/2016-17 including playoffs labeled separately as 2013-2017 Playoffs with a little over 6000+ rows that I've been building that looks like this. 我有一个包含2012-13 / 2016-17赛季比赛数据的NBA游戏数据集，其中包括我一直在构建的带有6000余行的季后赛标记为2013-2017季后赛，看起来像这样。

                  Date                 Visitor  V_PTS                  Home  \
25 2012-11-03 19:00:00        Sacramento Kings     98        Indiana Pacers   
26 2012-11-03 19:00:00    New Orleans Pelicans     89         Chicago Bulls   
27 2012-11-03 19:00:00          Boston Celtics     89    Washington Wizards   
28 2012-11-03 19:00:00  Portland Trail Blazers     95       Houston Rockets   
29 2012-11-03 19:30:00         Toronto Raptors    100         Brooklyn Nets   
30 2012-11-03 19:30:00       Charlotte Hornets     99      Dallas Mavericks   
31 2012-11-03 19:30:00   Golden State Warriors    114  Los Angeles Clippers   

    H_PTS  Attendance                     Arena                 Location  \
25    106       18165   Bankers Life Fieldhouse    Indianapolis, Indiana   
26     82       21758             United Center        Chicago, Illinois   
27     86       20308         Capital One Arena         Washington, D.C.   
28     85       18140             Toyota Center           Houston, Texas   
29    107       17732           Barclays Center       Brooklyn, New York   
30    126       19490  American Airlines Center            Dallas, Texas   
31    110       19060            Staples Center  Los Angeles, California   

    Capacity Yr Arena Opened   Season  H_Allstars  V_Allstars  V_wins  \
25     17923            1999  2012-13           1           0       0   
26     20917            1994  2012-13           2           0       1   
27     20356            1997  2012-13           0           2       0   
28     18055            2003  2012-13           1           1       1   
29     17732            2012  2012-13           1           0       0   
30     19200            2001  2012-13           0           0       1   
31     19060            1999  2012-13           2           1       1   

    V_losses  H_wins  H_losses  V_WPercent  H_WPercent  
25         2       1         1         0.0         0.5  
26         1       2         0         0.5         1.0  
27         2       0         1         0.0         0.0  
28         1       2         0         0.5         1.0  
29         1       0         0         0.0         0.0  
30         0       1         1         1.0         0.5  
31         1       2         0         0.5         1.0

I'm not trying to do anything to intense, but am trying to find what influences/predict NBA attendance for teams with reasonable amount of accuracy. 我并不是想做任何激烈的事情，而是试图找出对合理的准确性有影响/预测球队出勤率的因素。 Are their any other predictor variables you'd recommend adding? 您是否建议添加其他任何预测变量？ How would you suggest going about exploring the data to gain insight and what type if any ML packages might be useful. 您将如何建议探索数据以获取见识，以及如果有任何ML软件包可能有用的类型，您将如何建议。 My very first personal project so any all advice/examples is greatly appreciated. 我的第一个个人项目，因此感谢所有建议/示例。

Update: 更新：

After some quick exploration and browsing some step by step online tutorials I came across this correlation matrix on capacity percentage filled. 经过一番快速探索和逐步浏览之后，我遇到了有关容量百分比填充的相关矩阵。 I may be wrong, but do these numbers seem low? 我可能是错的，但是这些数字似乎很低吗？ Is each team too specific and I may need to focus on teams that don't sell out consistently or is there some variable that I should think about including that may have something to do with past attendance history? 每个团队是否都太具体了，我可能需要关注那些没有持续售罄的团队，或者我是否应该考虑一些变量，这可能与过去的出勤历史有关？

Answer 1

A few features I would consider adding: 我会考虑添加一些功能：

Home and Visitor Superstars. 家庭和访客超级巨星。 I know you have all stars but superstars are a bit different. 我知道您拥有所有的恒星，但是超级巨星则有些不同。 Example: Lebron vs. Demar Derozan. 例如：勒布朗对德玛·德罗赞（Demar Derozan）。 Both are super stars but many more people are likely to go see Lebron over Demar. 他们俩都是超级巨星，但是更多的人可能会去看勒布朗而不是戴玛。 It would, however, be quite a pain to add all of this and also to determine who is a superstar. 但是，将所有这些加在一起并确定谁是超级巨星将是非常痛苦的。
You may want to consider rivalries. 您可能要考虑竞争。 Boston vs Los Angeles (Lakers) is always a sell out crowd due to the history of the teams. 由于各支球队的历史，波士顿vs洛杉矶（湖人）一直都是抢购对象。
The number of season ticket holders may or may not affect the count. 季票持有者的数量可能会或可能不会影响计数。 I'd be interested to see if that held any weight. 我很想看看那是否有分量。

Some of these aren't too important. 其中一些不太重要。 I would consider getting rid of the game score since that is what happens at the end of the game and attendance is related to ticket sales (before the game starts). 我会考虑摆脱比赛成绩，因为那是比赛结束时发生的事情，而出勤率与门票销售有关（比赛开始之前）。 It looks like you do have a decent amount of data to work with. 看起来您确实有大量的数据可以使用。 Obviously its going to be a supervised model. 显然，它将成为监督模型。 Looks like regression will be something you will want to use for your model. 看起来回归将是您想要用于模型的东西。

我是否有足够的数据来进行可靠的分析？

问题描述

1 个解决方案

解决方案1
1 2018-05-02 15:12:02

我是否有足够的数据来进行可靠的分析？

问题描述

1 个解决方案

解决方案1 1 2018-05-02 15:12:02

解决方案1
1 2018-05-02 15:12:02