简体   繁体   English

我是否有足够的数据来进行可靠的分析?

[英]Do I have enough data to run reliable analysis?

I have an NBA game data set with games from 2012-13/2016-17 including playoffs labeled separately as 2013-2017 Playoffs with a little over 6000+ rows that I've been building that looks like this. 我有一个包含2012-13 / 2016-17赛季比赛数据的NBA游戏数据集,其中包括我一直在构建的带有6000余行的季后赛标记为2013-2017季后赛,看起来像这样。

                  Date                 Visitor  V_PTS                  Home  \
25 2012-11-03 19:00:00        Sacramento Kings     98        Indiana Pacers   
26 2012-11-03 19:00:00    New Orleans Pelicans     89         Chicago Bulls   
27 2012-11-03 19:00:00          Boston Celtics     89    Washington Wizards   
28 2012-11-03 19:00:00  Portland Trail Blazers     95       Houston Rockets   
29 2012-11-03 19:30:00         Toronto Raptors    100         Brooklyn Nets   
30 2012-11-03 19:30:00       Charlotte Hornets     99      Dallas Mavericks   
31 2012-11-03 19:30:00   Golden State Warriors    114  Los Angeles Clippers   

    H_PTS  Attendance                     Arena                 Location  \
25    106       18165   Bankers Life Fieldhouse    Indianapolis, Indiana   
26     82       21758             United Center        Chicago, Illinois   
27     86       20308         Capital One Arena         Washington, D.C.   
28     85       18140             Toyota Center           Houston, Texas   
29    107       17732           Barclays Center       Brooklyn, New York   
30    126       19490  American Airlines Center            Dallas, Texas   
31    110       19060            Staples Center  Los Angeles, California   

    Capacity Yr Arena Opened   Season  H_Allstars  V_Allstars  V_wins  \
25     17923            1999  2012-13           1           0       0   
26     20917            1994  2012-13           2           0       1   
27     20356            1997  2012-13           0           2       0   
28     18055            2003  2012-13           1           1       1   
29     17732            2012  2012-13           1           0       0   
30     19200            2001  2012-13           0           0       1   
31     19060            1999  2012-13           2           1       1   

    V_losses  H_wins  H_losses  V_WPercent  H_WPercent  
25         2       1         1         0.0         0.5  
26         1       2         0         0.5         1.0  
27         2       0         1         0.0         0.0  
28         1       2         0         0.5         1.0  
29         1       0         0         0.0         0.0  
30         0       1         1         1.0         0.5  
31         1       2         0         0.5         1.0 

I'm not trying to do anything to intense, but am trying to find what influences/predict NBA attendance for teams with reasonable amount of accuracy. 我并不是想做任何激烈的事情,而是试图找出对合理的准确性有影响/预测球队出勤率的因素。 Are their any other predictor variables you'd recommend adding? 您是否建议添加其他任何预测变量? How would you suggest going about exploring the data to gain insight and what type if any ML packages might be useful. 您将如何建议探索数据以获取见识,以及如果有任何ML软件包可能有用的类型,您将如何建议。 My very first personal project so any all advice/examples is greatly appreciated. 我的第一个个人项目,因此感谢所有建议/示例。

Update: 更新:

After some quick exploration and browsing some step by step online tutorials I came across this correlation matrix on capacity percentage filled. 经过一番快速探索和逐步浏览之后,我遇到了有关容量百分比填充的相关矩阵。 I may be wrong, but do these numbers seem low? 我可能是错的,但是这些数字似乎很低吗? Is each team too specific and I may need to focus on teams that don't sell out consistently or is there some variable that I should think about including that may have something to do with past attendance history? 每个团队是否都太具体了,我可能需要关注那些没有持续售罄的团队,或者我是否应该考虑一些变量,这可能与过去的出勤历史有关?

容量百分比的相关矩阵

A few features I would consider adding: 我会考虑添加一些功能:

  1. Home and Visitor Superstars. 家庭和访客超级巨星。 I know you have all stars but superstars are a bit different. 我知道您拥有所有的恒星,但是超级巨星则有些不同。 Example: Lebron vs. Demar Derozan. 例如:勒布朗对德玛·德罗赞(Demar Derozan)。 Both are super stars but many more people are likely to go see Lebron over Demar. 他们俩都是超级巨星,但是更多的人可能会去看勒布朗而不是戴玛。 It would, however, be quite a pain to add all of this and also to determine who is a superstar. 但是,将所有这些加在一起并确定谁是超级巨星将是非常痛苦的。

  2. You may want to consider rivalries. 您可能要考虑竞争。 Boston vs Los Angeles (Lakers) is always a sell out crowd due to the history of the teams. 由于各支球队的历史,波士顿vs洛杉矶(湖人)一直都是抢购对象。

  3. The number of season ticket holders may or may not affect the count. 季票持有者的数量可能会或可能不会影响计数。 I'd be interested to see if that held any weight. 我很想看看那是否有分量。

Some of these aren't too important. 其中一些不太重要。 I would consider getting rid of the game score since that is what happens at the end of the game and attendance is related to ticket sales (before the game starts). 我会考虑摆脱比赛成绩,因为那是比赛结束时发生的事情,而出勤率与门票销售有关(比赛开始之前)。 It looks like you do have a decent amount of data to work with. 看起来您确实有大量的数据可以使用。 Obviously its going to be a supervised model. 显然,它将成为监督模型。 Looks like regression will be something you will want to use for your model. 看起来回归将是您想要用于模型的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何对刚刚在Python Pandas数据框中累积的数据进行主成分分析? - How can I run a Principal Component Analysis on data I have just accumulated in Python Pandas dataframe? 当你有很多变量时如何进行智能数据分析 - how to do smart data analysis when you have a lot of variables 如何运行 IBM Watson Tone Analysis 示例代码? - How do I run IBM Watson Tone Analysis example code? 如何对我的算法进行运行时分析? - How can I do a run time analysis of my algortihm? 如何将“发票”级别的数据解析为柱状数据以进行分析? - How do I parse “invoice” level data into columnar data for analysis? 当我没有足够的内存来加载所有训练数据时,如何在Keras中训练 - How to train in Keras when I don't have enough memory for loading all training data 仅当存在足够数据时,如何有条件地将数据重新采样为每小时值? - How do I conditionally resample data into hourly values only when enough data is present? 为什么我必须在sess.run()中运行两个变量 - Why do I have to run two variables in the sess.run() 为什么我在 Windows 上从 Pillow 获得“Not Enough Image Data”,而相同的代码在 Linux 上运行良好? - Why do I get "Not Enough Image Data" from Pillow on Windows, while the same code works well on Linux? 运行时出现错误,如何解决? - I have an error when I run it, how do I fix it?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM