繁体   English   中英

根据不同列的值对 dataframe 执行查找

[英]perform lookup on dataframe based on value of a different column

有一个像这样的 dataframe -

df = {'Request': [0, 0, 1, 0, 1, 0, 0],
 'Time': ['16:00', '17:00', '18:00', '19:00', '20:00', '20:30', '24:00'],
 'grant': [3, 0, 0, 5, 0, 0, 5]}

pd.DataFrame(df).set_index('Time')

    Out[16]: 
       Request  grant
Time                 
16:00        0      3
17:00        0      0
18:00        1      0
19:00        0      5
20:00        1      0
20:30        0      0
24:00        0      5

“请求”列中的值为 boolean 并表示是否提出了请求。 1 = 请求 0 = 无请求。 “授予”列中的值表示初始授予大小。

我想计算每个请求的请求和授权之间的时间。 所以在这种情况下,他们将是 19:00 - 18:00 = 1 小时和 24:00-20:00 = 4 小时。 有没有办法使用 pandas 在大型数据集上轻松执行此操作?

我会 go 关于它是这样的:

df = {'Request': [0, 0, 1, 0, 1, 0, 0],
     'Time': ['16:00', '17:00', '18:00', '19:00', '20:00', '20:30', '24:00'],
     'grant': [3, 0, 0, 5, 0, 0, 5]}

df = pd.DataFrame(df) #create DataFrame

#get rid of any rows have neither a grant nor request
df = df[(df[['grant', 'Request']].T != 0).any()] 

#change the time in HH:MM to number of minutes
df['Time'] = df['Time'].str.split(":").apply(lambda x: int(x[0])*60 + int(x[1]))

#get the difference between those times
df['timeElapsed'] = df['Time'].diff()

#filter out the requests to only get the grants and their times. 
#Also, drop the NA from the first line.
df = df[(df[['grant']].T != 0).any()].dropna()

#drop all columns except timeElapsed and Grant
df = df[['timeElapsed', 'grant']]

然后 output 看起来像这样, timeElaped 以分钟为单位:

   timeElapsed  grant
3         60.0      5
6        240.0      5

您需要将您的时间列转换为 datetime 以获得差异,但您需要更改 24:00 以免出现错误。 然后你可以使用mask + pd.to_datetime 。从第一个request == 1 (df2) 然后你可以使用groupby根据外观创建组。 通过groupby.firstgroupby.last计算差异

#transform Time column to get the diff
df['Time'].mask(df['Time'].eq('24:00'),'00:00',inplace=True)
df['Time']=pd.to_datetime(df['Time'])

#select rows from first request==1
mask=df.Request.eq(1).cumsum()>0
df2=df[mask]

#creating serie to groupby
groups=df2['Request'].eq(1).cumsum()

#get the difference by group
g=df2.groupby(groups)['Time']
diff=(g.last()-g.first()).dt.seconds/3600

print(diff)

Request
1    1.0
2    4.0
Name: Time, dtype: float64

如果要创建新列,可以使用transform

#transform Time column to get the diff
df['Time'].mask(df['Time'].eq('24:00'),'00:00',inplace=True)
df['Time']=pd.to_datetime(df['Time'])
df['Time']=df['Time'].dt.hour

#select rows from first request==1
mask=df.Request.eq(1).cumsum()>0 #mask to first 1 in advance
df2=df[mask]

#creating serie to groupby
groups=df2['Request'].eq(1).cumsum() #serie to group

#Getting difference and save in a new column
g=df2.groupby(groups)['Time']
df.loc[mask,'difference']=g.transform(lambda x: x.iloc[len(x)-1]-x.iloc[0])
df['difference']=df['difference'].mask(df['difference']<0,df['difference']+24)
print(df)

   Request  Time  grant  difference
0        0    16      3         NaN
1        0    17      0         NaN
2        1    18      0         1.0
3        0    19      5         1.0
4        1    20      0         4.0
5        0    20      0         4.0
6        0     0      5         4.0

您首先需要将您的Time索引转换为可减去的东西以找到时间增量。 使用pd.to_timestamp不起作用,因为没有24:00 下面的解决方案使用十进制时间(1:30PM = 13.5):

# Convert the index into decimal time
df.index = pd.to_timedelta(df.index + ':00') / pd.Timedelta(hours=1)

# Get time when each request was made
r = df[df['Request'] != 0].index.to_series()

# Get time where each grant was made
g = df[df['grant'] != 0].index.to_series()

# `asof` mean "get the last available value in `r` as the in `g.index`
tmp = r.asof(g)
df['Delta'] = tmp.index - tmp

结果:

      Request  grant  Delta
Time                       
16.0        0      3    NaN
17.0        0      0    NaN
18.0        1      0    NaN
19.0        0      5    1.0
20.0        1      0    NaN
20.5        0      0    NaN
24.0        0      5    4.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM