简体   繁体   中英

How to calculate total occupancy days for each day of year, given a dataframe of start and end dates?

I have a csv file and hence list or dataframe that contains start and end dates of visits to a campsite.

    start_date   end_date
0   2016-01-21   2016-01-24
1   2016-01-28   2016-01-29
2   2016-02-02   2016-02-10
3   2016-02-08   2016-02-12
...

I would like to calculate a dataframe with a row for each day in the time period, with a column calculating cumulative visitors, a column denoting number of visitors resident on that day and a cumulative sum of visitor days.

I currently have some hacky code that reads the visitor data into an ordinary python list visitor_array , and creates another list year_array for each date in the period/year. It then loops for each date in year_array with an inner loop over visitor_array and appends the current element of year_array with a count of new visitors and number of resident visitors on that day.

temp_day = datetime.date(2016,1,1)
year_array = [[temp_day + datetime.timedelta(days=d)] for d in range(365)]

for day in year_array:
    new_visitors = 0
    occupancy = 0
    for visitor in visitor_array:
        if visitor[0] = day:
            new_visitors +=1
        if (visitor[0] <= day[0]) and (day[0] <= visitor[1]):
            occupancy +=1
    day = day.append(new_visitors)
    day = day.append(occupancy)

I then convert year_array into a pandas dataframe, create some cumsum columns and get busy plotting etc etc

Is there a more elegant pythonic/pandasic way of doing this all within pandas?

Considering df the dataframe with start/end values and d the final dataframe, I would have made something like this:

Code:

import numpy as np
import pandas as pd
import datetime

# ---- Create df sample
df = pd.DataFrame([['21/01/2016','24/01/2016'],
                    ['28/01/2016','29/01/2016'],
                    ['02/02/2016','10/02/2016'],
                    ['08/02/2016','12/02/2016']], columns=['start','end'] )
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

# ---- Create day index
temp_day = datetime.date(2016,1,1)
index = [(temp_day + datetime.timedelta(days=d)) for d in range(365)]

# ---- Create empty result df
# initialize df, set days as datetime in index
d = pd.DataFrame(np.zeros((365,3)),
                 index=pd.to_datetime(index),
                 columns=['new_visitor','occupancy','occupied_day'])

# ---- Iterate over df to fill d (final df)
for i, row in df.iterrows():
    # Add 1 if first day for new visitor
    d.loc[row.start,'new_visitor'] += 1
    # 1 if some visitor in df.start, df.end
    d.loc[row.start:row.end,'occupied_day'] = 1
    # Add 1 for visitor occupancy these days
    d.loc[row.start:row.end,'occupancy'] += 1

#cumulated days = some of occupied days
d['cumul_days'] = d.occupied_day.cumsum()
#cumulated visitors = some of occupancy
d['cumul_visitors'] = d.occupancy.cumsum()

Some extract of Resulting output print(d.loc['2016-01-21':'2016-01-29']) :

index         new_visitor  occupancy  occupied_day  cumul_days  cumul_visitors
2016-01-21          1.0        1.0           1.0         1.0             1.0
2016-01-22          0.0        1.0           0.0         1.0             2.0
2016-01-23          0.0        1.0           0.0         1.0             3.0
2016-01-24          0.0        1.0           0.0         1.0             4.0
2016-01-25          0.0        0.0           0.0         1.0             4.0
2016-01-26          0.0        0.0           0.0         1.0             4.0
2016-01-27          0.0        0.0           0.0         1.0             4.0
2016-01-28          1.0        1.0           1.0         2.0             5.0
2016-01-29          0.0        1.0           0.0         2.0             6.0

May this code helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM