[英]python pandas multiindex subtract rows with matching level 1 index
pandas DataFrame: 熊猫DataFrame:
Constructor: 构造函数:
iterables = [[date(2018,5,31),date(2018,6,26),date(2018,6,29),date(2018,7,1)],
['test1','test2']]
indx = pd.MultiIndex.from_product(iterables, names=['date','tests'])
col = ['tests_passing', 'tests_total']
data = np.array([[834,3476],[229,256],[1524,1738],[78,144],[1595,1738],[78,144],[1595,1738],[142,144]])
df = pd.DataFrame(data, index=indx, columns=col)
df = df.assign(tests_remaining= df['tests_total'] - df['tests_passing'])
Dataframe: 数据框:
tests_passing tests_total tests_remaining
date tests
2018-05-31 test1 834 3476 2642
test2 229 256 27
2018-06-26 test1 1524 1738 214
test2 78 144 66
2018-06-29 test1 1595 1738 143
test2 78 144 66
2018-07-01 test1 1595 1738 143
test2 142 144 2
This data consists of a number of test measurements (test1,test2,...,etc) each collected on some date. 此数据由一些在某个日期收集的测试测量值(test1,test2等)组成。 I want to create a new column in this dataframe named 'progress' which would in general select all rows where test = unique test (test1 for example) across all dates and subtract the 'tests_remaining' column value for that row at date0 with the next value for row at date1,date2,...,etc so basically:
df.loc[(date0,test0),'progress'] = df.loc[(date0,test0),'tests_remaining']-df.loc[(date1,test0),'tests_remaining]
(with the one exception that the first date would have a progress value of 0 since it was the first collected date). 我想在此数据框中创建一个名为“ progress”的新列,该列通常会选择所有日期中test =唯一测试(例如,test1)的所有行,并在date0减去该行的“ tests_remaining”列值,并添加下一个date1,date2等的行的值基本上是这样的:
df.loc[(date0,test0),'progress'] = df.loc[(date0,test0),'tests_remaining']-df.loc[(date1,test0),'tests_remaining]
(但有一个例外,因为第一个日期是第一个收集的日期,所以其进度值为0)。
The desired output will look like this: 所需的输出将如下所示:
tests_passing tests_total tests_remaining progress
date tests
5/31/2018 test1 834 3476 2642 0
test2 229 256 27 0
6/26/2018 test1 1524 1738 214 2428
test2 78 144 66 -39
6/29/2018 test1 1595 1738 143 71
test2 78 144 66 0
7/1/2018 test1 1595 1738 143 0
test2 142 144 2 64
So far I have been able to use loc[] with slices to select a single test at a time and perform this calculation as a resultant pandas Series, but I am unable to do this in general across all tests without specifying the test name explicitly in the split. 到目前为止,我已经能够使用带有切片的loc []一次选择一个测试并将其作为结果熊猫系列执行此计算,但是如果没有在中明确指定测试名称,我通常无法在所有测试中执行此操作分裂。 This is not a reasonable solution for me as in the real data there are hundreds of tests.
这对我来说不是一个合理的解决方案,因为在真实数据中有数百种测试。
All = slice(None)
df_slice = df.loc[(All,'test1'),'tests_remaining']
sub = df_slice.diff(periods=-1).shift(1).fillna(0);sub
date tests
2018-05-31 test1 0.0
2018-06-26 test1 2428.0
2018-06-29 test1 71.0
2018-07-01 test1 0.0
Name: tests_remaining, dtype: float64
Is there a more pandas idiomatic way to create the desired column as described? 有没有更多的熊猫惯用方式来创建所需的列,如上所述?
Thanks in advance for your help! 在此先感谢您的帮助!
You can groupby
level test and do diff
您可以按级别
groupby
测试并进行diff
df.groupby(level='tests').tests_remaining.diff().mul(-1)
Out[662]:
date tests
2018-05-31 test1 NaN
test2 NaN
2018-06-26 test1 2428.0
test2 -39.0
2018-06-29 test1 71.0
test2 -0.0
2018-07-01 test1 -0.0
test2 64.0
Name: tests_remaining, dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.