简体   繁体   English

Python Pandas-根据多个日期用多列中的NAN替换值?

[英]Python Pandas - replace values with NAN in multiple columns based on mutliple dates?

I have a dataframe that contains observations from multiple entities over time. 我有一个数据框,其中包含随时间推移来自多个实体的观察结果。 The index is a time series and is unique, but irregular. 该索引是一个时间序列,是唯一的,但是不规则。

The a section of the dataframe looks like this: 数据框的a部分如下所示:

DATE    ('ACTION', 111, 1/7/2010)   ('ACTION', 222, 1/5/2010)
1/1/2010    10                          5
1/2/2010    10                          5
1/3/2010    10                          5
1/4/2010    15                          5
1/5/2010    10                          5
1/6/2010    10                          5
1/7/2010    10                          5
1/8/2010    10                          5

The tuple is a hierarchical index. 元组是层次结构索引。 In the tuple value 1 is a category, value 2 is an ID and value 3 is an event date. 在元组中,值1是类别,值2是ID,值3是事件日期。 I want to use this event date as the maximum date -1 in the column and replace values after that date with NaN 我想将此事件日期用作列中的最大日期-1,然后用NaN替换该日期之后的值

The new frame would look like this: 新框架如下所示:

DATE    ('ACTION', 111, 1/7/2010)   ('ACTION', 222, 1/5/2010)
1/1/2010    10                          5
1/2/2010    10                          5
1/3/2010    10                          5
1/4/2010    15                          5
1/5/2010    10                          NaN
1/6/2010    10                          NaN
1/7/2010    NaN                         NaN
1/8/2010    NaN                         NaN

The dataframe could potentially contain 100000 columns. 该数据框可能包含100000列。 I understand how to replace the value is one column I think using a Boolean mask. 我知道如何替换值是我认为使用布尔掩码的一列。 I do not understand how to efficiently do this over multiple columns. 我不了解如何有效地在多个列上执行此操作。

The reason for needing this is to make sure observations are prior to an individual event that occurs at the event date. 需要这样做的原因是要确保观察是在事件日期发生的单个事件之前进行的。 Any help would be highly appreciated. 任何帮助将不胜感激。

Maybe also not that fast, but already a cleaner approach based on pandas: 也许还没有那么快,但是已经是一种基于熊猫的更清洁的方法:

df.where(df.apply(lambda x: x.index < pd.Timestamp(x.name[2])))

The apply returns a dataframe with True/False values (the < expression is evaluated for each column where x.name[2] selects the third level of that column name), and the where replaces the False values with NaN. apply程序apply返回一个具有True / False值的数据帧(对于每列评估<表达式,其中x.name[2]选择该列名的第三级),而where将False值替换为NaN。

Full example: 完整示例:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s = """,ACTION,ACTION
   ...: ,111,222
   ...: ,1/7/2010,1/5/2010
   ...: DATE,,
   ...: 1/1/2010,    10,                          5
   ...: 1/2/2010,    10,                          5
   ...: 1/3/2010,    10,                          5
   ...: 1/4/2010,    15,                          5
   ...: 1/5/2010,    10,                          5
   ...: 1/6/2010,    10,                          5
   ...: 1/7/2010,    10,                          5
   ...: 1/8/2010,    10,                          5"""

In [4]: df = pd.read_csv(StringIO(s), header=[0,1,2], index_col=0, parse_dates=True)

In [5]: df.where(df.apply(lambda x: x.index < pd.Timestamp(x.name[2])))
Out[5]:
              ACTION
                 111       222
            1/7/2010  1/5/2010
DATE
2010-01-01        10         5
2010-01-02        10         5
2010-01-03        10         5
2010-01-04        15         5
2010-01-05        10       NaN
2010-01-06        10       NaN
2010-01-07       NaN       NaN
2010-01-08       NaN       NaN

I am sure there may be better way to do this, but three lines would do the job 我相信可能会有更好的方法,但是三行就可以了

In [194]:

A=(np.array(pd.to_datetime(df['DATE']))[...,np.newaxis]+12*60*12*10**10)>\
   np.array([np.datetime64(pd.to_datetime(item[-1])) for item in df.columns.tolist()[1:]])
B=np.hstack((np.ones(len(df)).reshape((-1,1))!=1, A))
print df.where(~B)

#       DATE  (ACTION, 111, 1/7/2010)  (ACTION, 222, 1/5/2010)
#0  1/1/2010                       10                        5
#1  1/2/2010                       10                        5
#2  1/3/2010                       10                        5
#3  1/4/2010                       15                        5
#4  1/5/2010                       10                      NaN
#5  1/6/2010                       10                      NaN
#6  1/7/2010                      NaN                      NaN
#7  1/8/2010                      NaN                      NaN

#[8 rows x 3 columns]

I assume your DATE column is stored as string and the last item in each tuple in your column names is also stored in string . 我假设您的DATE列存储为string并且列名称中每个元组的最后一项也存储在string If both are the case, you will need the conversions in the first line, otherwise you may skip some. 如果两者都有,那么您将需要在第一行进行转换,否则您可以跳过一些转换。

Edit: It runs quire slow, 100 loops, best of 3: 4.55 ms per loop . 编辑:它运行quire慢, 100 loops, best of 3: 4.55 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM