繁体   English   中英

Python 如果日期(在 1 个数据帧中)介于其他两个日期(在第二个数据帧中)之间,则计算行数

[英]Python Taking the count of rows if the date (on 1 data frame) falls between two other dates (in a second data frame)

如果接触点的时间在辅助数据框上的一组两个日期之间,我正在寻找行数(由“总接触点”列表示)。即在两个日期之间发生了多少安装(df1) (df2). 我收到此错误:发生异常:ValueError 只能比较我的代码的安装部分的 SUM 上标记相同的系列对象。

例如:

df = '开始日期','结束日期' df2 = '事件日期','事件总数'

所需结果 = IF 事件日期 >= 开始日期 AND 事件日期 <= 结束日期,SUM(或 COUNT)事件总数

请看下面的代码:

import datetime
import pandas as pd


df_post_logs = pd.read_csv('logs_merged.csv',index_col=0)
df_installs = pd.read_csv('install_merge.csv',index_col=0)

'''Convert UTC to EST on Installs Add Column'''

df_installs['conversion date'] = pd.to_datetime(df_installs['conversion date'],infer_datetime_format='%Y-%m-%d')
df_installs['conversion time'] = pd.to_datetime(df_installs['conversion time'],infer_datetime_format='%H:%S:%M')

utc_datetime = df_installs['conversion time']
est_datetime = utc_datetime - datetime.timedelta(hours=5)


df_installs['utc datetime'] = utc_datetime
df_installs['est datetime'] = est_datetime

'''Add Column 10 Minutes Pre-Spot Time to Post Logs/10 Minutes Post Time to Spot'''

df_post_logs['Air Date'] = pd.to_datetime(df_post_logs['Air Date'],infer_datetime_format='%Y-%m-%d')
df_post_logs['Air Time'] = pd.to_datetime(df_post_logs['Air Time'],infer_datetime_format='%H:%S:%M')

timestamp = df_post_logs['Air Time']

df_post_logs['timestamp'] = timestamp
df_post_logs['pre spot time start'] = timestamp - datetime.timedelta(minutes=10, seconds=1)
df_post_logs['pre spot time end'] = timestamp - datetime.timedelta(seconds=1)
df_post_logs['post spot time'] = timestamp + datetime.timedelta(minutes=10)

'''SUM of Installs between pre-spot time'''

if df_installs['est datetime'] >= df_post_logs['pre spot time start'] and df_installs['est datetime'] <= df_post_logs['pre spot time end']:
    pre_spot_installs = np.count(df_post_logs['install time'])

df_post_logs['pre spot installs'] = pre_spot_installs

'''SUM of Installs between post-spot time'''

if df_installs['est datetime'] >= df_post_logs['timestamp'] and df_installs['est datetime'] <= df_post_logs['post spot time']:
    post_spot_installs = np.count(df_post_logs['install time'])

df_post_logs['post spot installs'] = post_spot_installs

'''Difference Between Post and Pre'''

if post_spot_installs - pre_spot_installs < 0:
    incremental_visits = 0
else:
    incremental_visits = post_spot_installs - pre_spot_installs

df_post_logs['incremental visits'] = incremental_visits

'''Multiply by TRP'''

lift = incremental_visits*df_post_logs['Dimension 5']
df_post_logs['lift'] = lift

'''Export to CSV'''

df_post_logs.to_csv("attribution.csv")

使用 numpy 比较时间戳的功能:
例如,让

数据框1:df

在此处输入图像描述

在此处输入图像描述

数据框2:df2

在此处输入图像描述

在此处输入图像描述

创建一个标记列来标记 df2 中位于两个时间戳之间的所有列

df2['mark'] = np.where(((df['Date Start']<= df2['Event Date']) & (df2['Event Date'] <= df['Date end'])),1,0)

在此处输入图像描述

过滤 df2 给出计数的总和:

df2[df2['mark'] == 1]['count'].sum()

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM