简体   繁体   English

熊猫通过groupby日期的总和归一化由datetimeindex索引的列

[英]Pandas normalize column indexed by datetimeindex by sum of groupby date

If given a dataframe that's indexed with a datetimeindex, is there an efficient way to normalize the values within a given day? 如果给定一个使用datetimeindex索引的数据框,是否有一种有效的方法可以在给定的日期内对值进行规范化? For example I'd like to sum all values for each day, and then divide each columns values by the resulting sum for the day. 例如,我想对每一天的所有值求和,然后将每一列的值除以当日的总和。

I can easily group by date and calculate the divisor (sum of values of each column for each date) but I'm not entirely sure the best way to divide the original dataframe by the resulting sum df. 我可以轻松地按日期分组并计算除数(每个日期的每一列的值的总和),但我不完全确定将原始数据帧除以结果总和df的最佳方法。

Example dataframe with datetimeindex and resulting df from sum 带有datetimeindex的示例数据帧,以及从求和所得的df

I attempted to do something like 我试图做类似的事情

df / df.groupby(df.index.to_period('D')).sum()

however it isn't behaving as I would have hoped for. 但是它的行为并不像我希望的那样。

Instead I'm getting a df with NaN everywhere and Date appended as new indexes. 相反,我到处都带有NaN的df,并将Date添加为新索引。

ie Results from above division 来自上述部门的结果

Toy recreation: 玩具娱乐:

df = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],columns=['a','b'], 
              index=pd.to_datetime(['2017-01-01 14:30:00','2017-01-01 14:31:00', 
                                    '2017-01-02 14:30:00', '2017-01-02 14:31:00']))
df / df.groupby(df.index.to_period('D')).sum()

results in 结果是

                     a  b
2017-01-01 14:30:00 NaN NaN
2017-01-01 14:31:00 NaN NaN
2017-01-02 14:30:00 NaN NaN
2017-01-02 14:31:00 NaN NaN
2017-01-01  NaN NaN
2017-01-02  NaN NaN

You will need to copy and paste your dataframe as text and not an image so I can help further but here is an example: 您将需要将数据框复制并粘贴为文本而不是图像,因此我可以提供进一步的帮助,但这是一个示例:

sample df 样本df

df1 = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'),
                  index=pd.date_range('2017-01-03', '2017-01-07'))

df2 = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'),
                  index=pd.date_range('2017-01-03', '2017-01-07'))

df = pd.concat([df1,df2])

               A            B          C            D           E
2017-01-03  1.393874    1.933301    0.215026    -0.412957   -0.293925
2017-01-04  0.825777    0.315449    2.317292    -0.347617   -2.427019
2017-01-05  -0.372916   -0.931185   0.049707    0.635828    -0.774566
2017-01-06  1.564714    -1.582461   1.455403    0.521305    -2.175344
2017-01-07  1.255747    1.967338    -0.766391   -0.021921   0.672704
2017-01-03  0.620301    -1.521681   -0.352800   -1.394239   -1.206983
2017-01-04  -0.041829   -0.870871   -0.402440   0.268725    1.499321
2017-01-05  -1.098647   1.690136    1.004087    0.304037    1.235310
2017-01-06  0.305645    -0.327096   0.280591    -0.476904   1.652096
2017-01-07  1.251927    0.469697    0.047694    1.838995    -0.258889

then what you are currently doing: 那么您当前正在做什么:

df / df.groupby(df.index).sum()

                A           B           C          D            E
2017-01-03  0.692032    4.696817    -1.560723   0.228507    0.195831
2017-01-03  0.307968    -3.696817   2.560723    0.771493    0.804169
2017-01-04  1.053357    -0.567944   1.210167    4.406211    2.616174
2017-01-04  -0.053357   1.567944    -0.210167   -3.406211   -1.616174
2017-01-05  0.253415    -1.226937   0.047170    0.676510    -1.681122
2017-01-05  0.746585    2.226937    0.952830    0.323490    2.681122
2017-01-06  0.836585    0.828706    0.838369    11.740853   4.157386
2017-01-06  0.163415    0.171294    0.161631    -10.740853  -3.157386
2017-01-07  0.500762    0.807267    1.066362    -0.012064   1.625615
2017-01-07  0.499238    0.192733    -0.066362   1.012064    -0.625615

Take a look at the first row col A 看看第一行col A

1.393874 / (1.393874 + 0.620301) = 0.6920322216292031 so your example of df / df.groupby(df.index).sum() is working as expected. 1.393874 / (1.393874 + 0.620301) = 0.6920322216292031因此您的df / df.groupby(df.index).sum()示例按预期工作。

Also be careful if your data contains NaNs because np.nan / a number = nan 如果您的数据包含NaN, np.nan / a number = nan小心,因为np.nan / a number = nan

update per comment: 每个评论的更新:

df = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],columns=['a','b'], 
              index=pd.to_datetime(['2017-01-01 14:30:00','2017-01-01 14:31:00', 
                                    '2017-01-02 14:30:00', '2017-01-02 14:31:00']))


# create multiindex with level 1 being just dates
df.set_index(df.index.floor('D'), inplace=True, append=True)

# divide df by the group sum matching the index values of level 1
df.div(df.groupby(level=1).sum(), level=1).reset_index(level=1, drop=True)

                          a         b
2017-01-01 14:30:00   0.250000  0.333333
2017-01-01 14:31:00   0.750000  0.666667
2017-01-02 14:30:00   0.416667  0.428571
2017-01-02 14:31:00   0.583333  0.571429

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM