I used to have a SQL query to count number of records for a given day, in a given location.
Input Data structure was like this : id, location, start_date, end_date
import pandas as pd
data = [('20170009003','0681','2017-07-25','2017-08-02'),
('20170009221','0682','2017-07-28','2017-08-02'),
('20170009271','0682','2017-07-31','2017-08-02'),
('20170009286','0681','2017-07-18','2017-09-19'),
('20170009654','0682','2017-07-28','2017-08-03'),
('20170010053','0681','2017-07-31','2017-08-04'),
('20170010059','0681','2017-07-20','2017-08-07')]
labels = ['idnum','loc','start_date','end_date']
df = pd.DataFrame.from_records(data, columns=labels)
This would give me the count of (present) persons on a given day. ie the '2018-08-01', would get:
2018-08-01, 0681, 4
2018-08-01, 0682, 3
I'd like to produce a similar result with python/pandas.
If it's of any help, the sql (postgreql function) used to achieve the above goal was :
CREATE OR REPLACE FUNCTION nb_present(oneday date)
RETURNS TABLE(ddj date, loc character, eff numeric)
LANGUAGE sql
AS $function$
SELECT $1, loc,sum(case when ($1= start_date and start_date_end_date) then 1
when $1=start_date then 0.5
when $1=end_date then 0.5
when ($1 > start_date and $1 < end_date) then 1
else 0 end)
from passage group by 1,2 order by 1,2;
$function$
Thanks for your help.
PS: This is my first post here.
I believe this is what you're looking for (make sure your startdate
and enddate
are pandas Datetime
objects):
dt = pd.to_datetime('2018-08-01')
df1 = df[(df['startdate'] > dt) & (df['enddate'] < dt)].groupby('loc').count().to_frame()
df1['Date'] = dt
IIUC:
target = '2017-08-01'
df[(df['start_date'] < target) & (df['end_date'] > target)].groupby(['loc']).size()
Output:
loc
0681 4
0682 3
Here's one solution if you want to do this frequently for several dates: We create another DataFrame
that checks whether that row is between the start and end dates (using an IntervalIndex
, but not necessary). We can then group that DataFrame
by the loc
variable in the other DataFrame
(grouping is aligned on index, so we use .reset_index
to ensure everything is aligned with our newly created DataFrame
) and just take a sum, since we have True
or False
import pandas as pd
import numpy as np
df['start_date'] = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
df.index = pd.IntervalIndex.from_arrays(df.start_date, df.end_date, closed='both')
# Dates you care about
dates = pd.to_datetime(['2017-08-01', '2017-08-02', '2017-08-03'])
df_bet = pd.DataFrame(np.reshape([d in ids for d in dates for ids in df.index] ,(-1, len(df))), index=dates).T
df_bet.groupby(df.reset_index()['loc']).agg(sum)
2017-08-01 2017-08-02 2017-08-03
loc
0681 4.0 4.0 3.0
0682 3.0 3.0 1.0
With your help, i came with :
import pandas as pd
data = [('20170009003','0681','2017-07-25','2017-08-02'),
('20170009221','0682','2017-07-28','2017-08-02'),
('20170009271','0682','2017-07-31','2017-08-02'),
('20170009286','0681','2017-07-18','2017-09-19'),
('20170009654','0682','2017-07-28','2017-08-03'),
('20170010053','0681','2017-07-31','2017-08-04'),
('20170010059','0681','2017-07-20','2017-08-07')]
labels = ['idnum','loc','start_date','end_date']
df = pd.DataFrame.from_records(data, columns=labels)
df['end_date'] = pd.to_datetime(df['end_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
dt = pd.to_datetime('2017-08-01')
df1 = df[(df['start_date'] < dt) & (df['end_date'] > dt)].groupby('loc').size().to_frame()
df1['Date'] = dt
Which works fine.
Now, I have to tweak it to count the number of present for each day between two dates. I'll keep that as homework.
Thanks a lot
Using just python this is possible, using sorted with two elements and groupby with two elements
from itertools import groupby
from operator import itemgetter
data = sorted(data, key= itemgetter(-1, 1))
for k, g in groupby(data, key = itemgetter(-1, 1)):
print('{}, {}, {}'.format(k[0], k[1], len(list(g))))
2017-08-02, 0681, 1 2017-08-02, 0682, 2 2017-08-03, 0682, 1 2017-08-04, 0681, 1 2017-08-07, 0681, 1 2017-09-19, 0681, 1
I finally came up with a sligthly different solution. As I needed to merge the resulting dataframe with another one, here is what I did:
df0 = pd.DataFrame()
for dt in pd.date_range('2017-08-01', '2017-08-05'):
df1 = df[(df['start_date'] < dt) & (df['end_date'] > dt)].groupby('loc').size().to_frame().reset_index()
df1['Date'] = dt
df0 = df0.append(df1)
Best regards
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.