简体   繁体   English

熊猫平均选定的行和列

[英]Pandas averaging selected rows and columns

I am working with some EPL stats.我正在处理一些 EPL 统计数据。 I have csv with all matches from one season in following format.我有一个赛季所有比赛的csv,格式如下。

D           H                A           H_SC  A_SC H_ODDS  D_ODDS  A_ODDS...
11.05.2014  Norwich          Arsenal     0     2    5.00    4.00    1.73 
11.05.2014  Chelsea          Swansea     0     0    1.50    3.00    5.00     

What I would like to do is for each match calculate average stats of teams from N previous matches.我想做的是为每场比赛计算前 N 场比赛球队的平均数据。 The result should look something like this.结果应该是这样的。

D           H        A        H_SC         A_SC         H_ODDS  D_ODDS  A_ODDS...
11.05.2014  Norwich  Arsenal  avgNorwichSC avgArsenalSC 5.00    4.00    1.73
11.05.2014  Chelsea  Swansea  avgChelseaSC avgSwanseaSC 1.50    3.00    5.00 

So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches.所以日期、球队和赔率保持不变,其他统计数据被替换为前 N 场比赛的平均值。 EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.编辑:前 N 轮的比赛不应该进入决赛桌,因为没有足够的数据来计算平均值。

The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.对我来说最棘手的部分是我平均的统计数据有不同的前缀(H_ 或 A_),具体取决于比赛的地点。

All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.我现在设法做的就是创建字典,其中键是俱乐部名称,值是包含俱乐部参加的所有比赛的数据帧。

D           H        A          H_SC  A_SC  H_ODDS  D_ODDS  A_ODDS...
11.05.2014  Norwich  Arsenal    0     2     5.00    4.00    1.73
04.05.2014  Arsenal  West Brom  1     0     1.40    5.25    8.00 

I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).我之前也曾在没有 Pandas 的情况下编写过这个代码,但我对代码不满意,我想学习 Pandas :)。

You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track.你说你想学习熊猫,所以我给出了一些例子(用类似的数据测试)来让你走上正轨。 It's a bit of an opinion, but I think finding the last N games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first.这是一个有点意见,但我认为找到最后N场比赛很难,所以我最初会假设/假装你想找到整个表的平均值。 If finding "last N" is really import, I can add to the answer.如果找到“最后 N”真的很重要,我可以添加到答案中。 This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.这应该让你开始使用 pandas 和 gropuby - 我已经留下了一些印刷品,所以你可以了解发生了什么。

import pandas

EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')

#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame

homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count()) 

print teamAveScore

You now have average scores for each team along with their most recent match dates.您现在拥有每支球队的平均得分以及他们最近的比赛日期。 All you have to do now is select the relevant rows of the original dataframe using the most recent dates (ie eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.您现在要做的就是使用最新日期(即除分数列之外的所有内容)选择原始数据帧的相关行,然后使用该行的团队名称从平均分数数据帧中进行选择。

eg例如

recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]

print recentRows

def insertAverages(s):
    a = teamAveScore[s['H']] 
    b = teamAveScore[s['A']]
    print a,b
    return pandas.Series(dict(H_AVSC=a, A_AVSC=b))

finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)

print finalTable

finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches finalTable 包含您最近比赛的原始赔率等,还有两个额外的列(H_AVSC 和 A_AVSC),用于显示参与这些比赛的主客队的平均得分

Edit编辑

Couple of gotchas几个问题

  1. just noticed I didn't put a format string in to_datetime() .只是注意到我没有在to_datetime()放入格式字符串。 For your dates - they look like UK format with dots so you should do对于您的日期 - 它们看起来像带点的英国格式,所以您应该这样做

    EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')

  2. You could use the minimum of the dates in teamLastGame instead of the hard coded 2015/01/10 in my example.在我的示例中,您可以使用teamLastGame的最小日期,而不是硬编码的2015/01/10

  3. If you really need to replace column H_SC with H_AVSC in your finalTable , rather than add on the averages:如果您真的需要在H_SC中用H_AVSC替换列finalTable ,而不是添加平均值:

     newCols = recentRows.apply(insertAverages, axis = 1)
    recentRows['H_SC'] = newCols['H_AVSC'] recentRows['A_SC'] = newCols['A_AVSC']
    print recentRows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM