[英]Apply function to a MultiIndex dataframe with pandas/python
I have the following DataFrame
that I wish to apply some date range calculations to. 我有以下
DataFrame
,我希望应用一些日期范围计算。 I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (ie the first sample). 我想在日期框架中选择行,其中唯一人的样本之间的日期差异(来自sample_date)小于8周,并保留具有最早日期的行(即第一个样本)。
Here is an example dataset. 这是一个示例数据集。 The actual dataset can exceed 200,000 records.
实际数据集可超过200,000条记录。
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. 独特的人是同名和dob。 For example John A, James, Mark A, and Mark B are unique persons.
例如John A,James,Mark A和Mark B是独特的人。 Mark A however has different id values.
但是,Mark A具有不同的id值。
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. 我通常使用R作为过程,并根据名称/ dob组合生成数据帧列表,并按sample_date对每个数据帧进行排序。 I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date.
然后,我将使用列表应用函数来确定每个数据帧中第一个和最后一个索引之间的日期差异,如果距离最近一个日期不到8周,则返回最旧的。 It takes forever.
这需要永远。
I would welcome a few pointers as to how I might attempt this with python/pandas. 我会欢迎一些关于如何使用python / pandas尝试这一点的指示。 I started by making a MultiIndex with name/dob/id.
我开始创建一个名为/ dob / id的MultiIndex。 The structure looks like what I want.
结构看起来像我想要的。 What I need to do is try applying some of the functions I use in R to select out the rows I need.
我需要做的是尝试应用我在R中使用的一些函数来选择我需要的行。 I have tried selecting with
df.xs()
but I am not getting very far. 我尝试用
df.xs()
选择,但我没有走得太远。
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order). 这是一个可以轻松加载到pandas中的数据字典(尽管列顺序不同)。
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3: '30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935', 7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977', 11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14: '9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
{'dob':{0:'12 / 07/1969',1:'10 / 01/1964',2:'30 / 08/1958',3:'30 / 08/1958',4:'12 / 05/1935',5:'12 / 07/1969',6:'12 / 05/1935',7:'5/12/1921',8:'8/08/1986',9:'4 / 03/1992',10:'1/10/1977',11:'1/06/1955',12:'1/06/1955',13:'9/12/1984',14:'9 / 12/1984'},'id':{0:12345,1:54321,2:87878,3:45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211, 10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544}, 'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15}, 'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A', 6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C', 13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2: 'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7: 'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',4:33322,5:12345,6:33322,7:65655,8:65459,9:41211,10:12345,11:56465,12:45456,13:55544,14:55544},'labno':{ 0:1,1:2,2:3,3:4,4:5,5:6,6:7,7:8,8:9,9:10,10:11,11:12,12: 13,13:14,14:15},'位置':{0:'A',1:'B',2:'A',3:'B',4:'C',5:'A ',6:'A',7:'B',8:'A',9:'C',10:'A',11:'C',12:'C',13:'A', 14:'A'},'名字':{0:'John A',1:'John B',2:'James',3:'James',4:'Peter',5:'John A' ,6:'彼得',7:'杰克',8:'吉尔',9:'朱莉',10:'安吉拉',11:'马克A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0: '12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012', 4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7: '15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10: '23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13: '13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',12:'Mark A',13:'Mark B',14:'Mark B'},'sample_date':{0:'12 / 05/2112',1:'6/12/2010',2:' 30/04/2012',3:'29 / 04/2012',4:'15 / 07/2011',5:'14 / 05/2012',6:'23 / 03/2011',7:' 15/08/2011',8:'16 / 02/2012',9:'15 / 09/2011',10:'23 / 10/2006',11:'4/04/2011',12:' 3/04/2011',13:'13 / 09/2012',14:'1/01/2012'},'性':{0:'M',1:'M',2:'M' ,3:'M',4:'M',5:'M',6:'M',7:'M',8:'F',9:'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}10:'F',11:'M',12:'M',13:'M',14:'M'}}
I think what you might be looking for is 我想你可能正在寻找的是
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all')
to remove them. 根据您是否要保留不超过1个样本的人,您可以调用
delta.dropna(how='all')
来删除它们。
Note that I think you'll need numpy >= 1.7
for the timedelta64
comparison to work correctly, as there are a whole host of problems with timedelta64
/ datetime64
for numpy < 1.7
. 请注意,我认为你需要
numpy >= 1.7
才能使timedelta64
比较正常工作,因为对于numpy < 1.7
, timedelta64
/ datetime64
存在许多问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.