简体   繁体   中英

Sum weekly totals of values from one data frame based on dates in another data frame in python

I want to sum the values in one column of a dataframe for certain dates that are defined by another dataframe.

My first dataframe of dates looks like this:

import numpy as np
import pandas as pd

start_date = ["2-22-16 00:00:00", "2-29-16 00:00:00", "3-7-16 00:00:00", "3-14-16 00:00:00", "3-21-16 00:00:00", "3-28-16 00:00:00", "4-4-16 00:00:00", "4-11-16 00:00:00", "4-18-16 00:00:00", "4-25-16 00:00:00", "5-2-16 00:00:00", "5-9-16 00:00:00", "5-16-16 00:00:00", "5-23-16 00:00:00", "5-30-16 00:00:00", "6-6-16 00:00:00", "6-13-16 00:00:00", "6-20-16 00:00:00", "6-27-16 00:00:00", "7-4-16 00:00:00", "7-11-16 00:00:00", "7-18-16 00:00:00", "7-25-16 00:00:00", "8-08-16 00:00:00", "8-22-16 00:00:00", "8-29-16 00:00:00", "9-5-16 00:00:00", "9-12-16 00:00:00", "9-19-16 00:00:00", "9-26-16 00:00:00", "10-3-16 00:00:00", "10-10-16 00:00:00", "10-17-16 00:00:00", "10-24-16 00:00:00", "10-31-16 00:00:00", "11-7-16 00:00:00", "11-14-16 00:00:00", "11-21-16 00:00:00", "1-23-17 00:00:00", "1-30-17 00:00:00", "2-06-17 00:00:00", "3-13-17 00:00:00", "3-27-17 00:00:00", "6-19-17 00:00:00", "6-26-17 00:00:00"]
start_date = [pd.to_datetime(d) for d in start_date]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date}); ndf.head()

What I want is values from another data frame that fall within the weeks defined in ndf . My other dataframe looks something like this:

dates = ["4-17-16 04:00:00", "4-16-16 19:30:00", "4-16-16 19:00:00", "2-24-16 09:00:00", "3-21-16 02:00:00", "3-18-16 10:00:00", "3-24-16 05:00:00", "3-11-16 00:00:00"]
df = pd.DataFrame(
    {'timestamp': dates,
     'value': np.random.randint(1,25,size=(8,))})

Now I want to create a new data frame that sums all the values from df that fall between the dates in ndf . So I created this function:

def get_dates(x):
    # Select the df values between start and ending datetime. 
    n = df[(df['timestamp']>ndf['start'])&(df['timestamp']<ndf['end'])]
    # Return sum of values
    return n.values[0],n['value'].sum()

I also played around with this: n = df[(df['timestamp']>ndf['start'])&(df['timestamp']<ndf['end'])] . But I get the error: ValueError: Can only compare identically-labeled Series objects .

I'm looking for someone to help me clean up my function so that it works or provide insight on the error message above. Thanks!

For your specific case where start dates and end dates form one continuous time period, probably you would want to to use something like this:

def get_dates():
    # Select the df values between start and ending datetime. 
    n = df[(df['timestamp'] > ndf['start'].min()) & 
           (df['timestamp'] < ndf['end'].max())]
    # Return sum of values
    return n.values[0], n['value'].sum()

And your error says that you are trying to compare arrays of different lengths. 长度比较不同长度的数组。 Your ndf has 45 rows when df has 1000

Edit: I am not sure if there is a prettier solution for a discontinuous time period than to iterate over both dataframes:

def get_dates():
    count = 0
    for index, values_row in df.iterrows():
        for _, time_deltas_row in ndf.iterrows():
            if time_deltas_row['start'] < values_row['timestamp'] < time_deltas_row['end']:
                count += 1
                continue
    return count

Use resample when you want to group data by evenly-spaced time intervals.

df.set_index('timestamp').resample('w-mon', label='left').sum().reset_index()

Returns:

   timestamp  value
0 2016-02-22   22.0
1 2016-02-29    NaN
2 2016-03-07   13.0
3 2016-03-14   20.0
4 2016-03-21    9.0
5 2016-03-28    NaN
6 2016-04-04    NaN
7 2016-04-11   34.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM