简体   繁体   English

迭代 PANDAS 中的唯一值

[英]iterate over unique values in PANDAS

I have a dataset in the following format:我有以下格式的数据集:

Patient  Date       colA  colB
1        1/3/2015   .     5
1        2/5/2015   3     10
1        3/5/2016   8     .
2        4/5/2014   2     .
2        etc

I am trying to define a function in PANDAS which treats unique patients as an item and iterates over these unique patient items to keep only to most recent observation per column (replacing all other values with missing or null).我试图在 PANDAS 中定义一个函数,它将独特的患者视为一个项目并迭代这些独特的患者项目以仅保留每列的最新观察结果(用缺失或空值替换所有其他值)。 For example: for patient 1, the output would entail -例如:对于患者 1,输出将需要 -

Patient  Date       colA  colB
1        1/3/2015   .     .
1        2/5/2015   .     10
1        3/5/2016   8     .

I understand that I can use something like the following with .apply(), but this does not account for duplicate patient IDs...我知道我可以在 .apply() 中使用类似下面的内容,但这不考虑重复的患者 ID...

def getrecentobs():
    for i in df['Patient']:
        etc

Any help or direction is much appreciated.非常感谢任何帮助或指导。

I think you can use to_numeric for convert values .我认为您可以使用to_numeric来转换值. to NaN , then create mask with groupby and rank and last apply mask :NaN ,然后使用groupbyrank创建mask并最后应用mask

print df
   Patient      Date colA colB
0        1  1/3/2015    .    5
1        1  2/5/2015    3   10
2        1  3/5/2016    8    .
3        2  4/5/2014    2    .
4        2  5/5/2014    4    .

df['colA'] = pd.to_numeric(df['colA'], errors='coerce')
df['colB'] = pd.to_numeric(df['colB'], errors='coerce')
print df
   Patient      Date  colA  colB
0        1  1/3/2015   NaN     5
1        1  2/5/2015     3    10
2        1  3/5/2016     8   NaN
3        2  4/5/2014     2   NaN
4        2  5/5/2014     4   NaN
print df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False)
   colA  colB
0   NaN     2
1     2     1
2     1   NaN
3     2   NaN
4     1   NaN

mask = df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False) == 1
print mask
    colA   colB
0  False  False
1  False   True
2   True  False
3  False  False
4   True  False

df[['colA','colB']] = df[['colA','colB']][mask]
print df
   Patient      Date  colA  colB
0        1  1/3/2015   NaN   NaN
1        1  2/5/2015   NaN    10
2        1  3/5/2016     8   NaN
3        2  4/5/2014   NaN   NaN
4        2  5/5/2014     4   NaN

There is a function in pandas called last which can be used with groupby to give you the last values for a given groupby. Pandas 中有一个名为 last 的函数,它可以与 groupby 一起使用,为您提供给定 groupby 的最后一个值。 I'm not sure why you require the blank rows but if you need them you can join the groupby back on the original data frame.我不确定您为什么需要空白行,但如果您需要它们,您可以将 groupby 加入原始数据框。 Sorry the sort is there as the date was not sorted in my sample data.抱歉,排序是存在的,因为我的示例数据中没有对日期进行排序。 Hope that helps.希望有帮助。

Example:例子:

DataFrame数据框

     id        date     amount  code
  0  3107  2010-10-20   136.4004   290
  1  3001  2010-10-08   104.1800   290
  2  3109  2010-10-08   276.0629   165
  3  3001  2010-10-08  -177.9800   290
  4  3002  2010-10-08  1871.1094   290
  5  3109  2010-10-08   225.7038   155
  6  3109  2010-10-08    98.5578   170
  7  3107  2010-10-08   231.3949   165
  8  3203  2010-10-08   333.6636   290
  9 -9100  2010-10-08  3478.7500   290

If previous rows not needed:如果不需要前几行:

  b.sort_values("date").groupby(["id","date"]).last().reset_index()

The groupby aggregates the data by the "last" meaning the last value for those columns. groupby 按“last”聚合数据,表示这些列的最后一个值。

Output only latest rows with values:仅输出具有值的最新行:

   id        date     amount  code
0 -9100  2010-10-08  3478.7500   290
1  3001  2010-10-08  -177.9800   290 
2  3002  2010-10-08  1871.1094   290
3  3107  2010-10-08   231.3949   165
4  3107  2010-10-20   136.4004   290
5  3109  2010-10-08    98.5578   170
6  3203  2010-10-08   333.6636   290

I think you are looking for pandas groupby .我认为您正在寻找pandas groupby

For example, df.groubpy('Patient').last() will return a DataFrame with the last observation of each patient .例如, df.groubpy('Patient').last()将返回一个 DataFrame ,其中包含对每个患者的最后一次观察。 If the patients are not sorted by date you can find the latest record date using max function.如果患者未按date排序,您可以使用max函数查找最新记录日期。

df.groupby('Patient').last()
             Date colA colB
Patient                    
1        3/5/2016    8    .
2             etc    2    .

You can make your own functions and then call the apply() function of groupby .您可以创建自己的函数,然后调用groupbyapply()函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM