使用pandas / python将函数应用于MultiIndex数据框

Question

I have the following DataFrame that I wish to apply some date range calculations to. 我有以下DataFrame ，我希望应用一些日期范围计算。 I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (ie the first sample). 我想在日期框架中选择行，其中唯一人的样本之间的日期差异（来自sample_date）小于8周，并保留具有最早日期的行（即第一个样本）。

Here is an example dataset. 这是一个示例数据集。 The actual dataset can exceed 200,000 records. 实际数据集可超过200,000条记录。

labno   name    sex dob         id     location  sample_date
1       John A  M   12/07/1969  12345  A         12/05/2112
2       John B  M   10/01/1964  54321  B         6/12/2010
3       James   M   30/08/1958  87878  A         30/04/2012
4       James   M   30/08/1958  45454  B         29/04/2012
5       Peter   M   12/05/1935  33322  C         15/07/2011
6       John A  M   12/07/1969  12345  A         14/05/2012
7       Peter   M   12/05/1935  33322  A         23/03/2011
8       Jack    M   5/12/1921   65655  B         15/08/2011
9       Jill    F   6/08/1986   65459  A         16/02/2012
10      Julie   F   4/03/1992   41211  C         15/09/2011
11      Angela  F   1/10/1977   12345  A         23/10/2006
12      Mark A  M   1/06/1955   56465  C         4/04/2011
13      Mark A  M   1/06/1955   45456  C         3/04/2011
14      Mark B  M   9/12/1984   55544  A         13/09/2012
15      Mark B  M   9/12/1984   55544  A         1/01/2012

Unique persons are those with the same name and dob. 独特的人是同名和dob。 For example John A, James, Mark A, and Mark B are unique persons. 例如John A，James，Mark A和Mark B是独特的人。 Mark A however has different id values. 但是，Mark A具有不同的id值。

I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. 我通常使用R作为过程，并根据名称/ dob组合生成数据帧列表，并按sample_date对每个数据帧进行排序。 I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. 然后，我将使用列表应用函数来确定每个数据帧中第一个和最后一个索引之间的日期差异，如果距离最近一个日期不到8周，则返回最旧的。 It takes forever. 这需要永远。

I would welcome a few pointers as to how I might attempt this with python/pandas. 我会欢迎一些关于如何使用python / pandas尝试这一点的指示。 I started by making a MultiIndex with name/dob/id. 我开始创建一个名为/ dob / id的MultiIndex。 The structure looks like what I want. 结构看起来像我想要的。 What I need to do is try applying some of the functions I use in R to select out the rows I need. 我需要做的是尝试应用我在R中使用的一些函数来选择我需要的行。 I have tried selecting with df.xs() but I am not getting very far. 我尝试用df.xs()选择，但我没有走得太远。

Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order). 这是一个可以轻松加载到pandas中的数据字典（尽管列顺序不同）。

{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3: '30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935', 7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977', 11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14: '9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454, {'dob'：{0：'12 / 07/1969'，1：'10 / 01/1964'，2：'30 / 08/1958'，3：'30 / 08/1958'，4：'12 / 05/1935'，5：'12 / 07/1969'，6：'12 / 05/1935'，7：'5/12/1921'，8：'8/08/1986'，9：'4 / 03/1992'，10：'1/10/1977'，11：'1/06/1955'，12：'1/06/1955'，13：'9/12/1984'，14：'9 / 12/1984'}，'id'：{0：12345,1：54321,2：87878,3：45454，
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211, 10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544}, 'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15}, 'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A', 6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C', 13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2: 'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7: 'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A', 4：33322,5：12345,6：33322,7：65655,8：65459,9：41211,10：12345,11：56465,12：45456,13：55544,14：55544}，'labno'：{ 0：1,1：2,2：3,3：4,4：5,5：6,6：7,7：8,8：9,9：10,10：11,11：12,12： 13,13：14,14：15}，'位置'：{0：'A'，1：'B'，2：'A'，3：'B'，4：'C'，5：'A '，6：'A'，7：'B'，8：'A'，9：'C'，10：'A'，11：'C'，12：'C'，13：'A'， 14：'A'}，'名字'：{0：'John A'，1：'John B'，2：'James'，3：'James'，4：'Peter'，5：'John A' ，6：'彼得'，7：'杰克'，8：'吉尔'，9：'朱莉'，10：'安吉拉'，11：'马克A'，
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0: '12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012', 4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7: '15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10: '23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13: '13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F', 12：'Mark A'，13：'Mark B'，14：'Mark B'}，'sample_date'：{0：'12 / 05/2112'，1：'6/12/2010'，2：' 30/04/2012'，3：'29 / 04/2012'，4：'15 / 07/2011'，5：'14 / 05/2012'，6：'23 / 03/2011'，7：' 15/08/2011'，8：'16 / 02/2012'，9：'15 / 09/2011'，10：'23 / 10/2006'，11：'4/04/2011'，12：' 3/04/2011'，13：'13 / 09/2012'，14：'1/01/2012'}，'性'：{0：'M'，1：'M'，2：'M' ，3：'M'，4：'M'，5：'M'，6：'M'，7：'M'，8：'F'，9：'F'，
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}} 10：'F'，11：'M'，12：'M'，13：'M'，14：'M'}}

Answer 1

I think what you might be looking for is 我想你可能正在寻找的是

def differ(df):
    delta = df.sample_date.diff().abs()  # only care about magnitude
    cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
    return df[cond].max()

delta = df.groupby(['dob', 'name']).apply(differ)

Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them. 根据您是否要保留不超过1个样本的人，您可以调用delta.dropna(how='all')来删除它们。

Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64 / datetime64 for numpy < 1.7 . 请注意，我认为你需要numpy >= 1.7才能使timedelta64比较正常工作，因为对于numpy < 1.7 ， timedelta64 / datetime64存在许多问题。

使用pandas / python将函数应用于MultiIndex数据框

问题描述

1 个解决方案

解决方案1
6 已采纳 2013-08-10 03:49:39

使用pandas / python将函数应用于MultiIndex数据框

问题描述

1 个解决方案

解决方案1 6 已采纳 2013-08-10 03:49:39

解决方案1
6 已采纳 2013-08-10 03:49:39