I want to sum the values in one column of a dataframe for certain dates that are defined by another dataframe.
My first dataframe of dates looks like this:
import numpy as np
import pandas as pd
start_date = ["2-22-16 00:00:00", "2-29-16 00:00:00", "3-7-16 00:00:00", "3-14-16 00:00:00", "3-21-16 00:00:00", "3-28-16 00:00:00", "4-4-16 00:00:00", "4-11-16 00:00:00", "4-18-16 00:00:00", "4-25-16 00:00:00", "5-2-16 00:00:00", "5-9-16 00:00:00", "5-16-16 00:00:00", "5-23-16 00:00:00", "5-30-16 00:00:00", "6-6-16 00:00:00", "6-13-16 00:00:00", "6-20-16 00:00:00", "6-27-16 00:00:00", "7-4-16 00:00:00", "7-11-16 00:00:00", "7-18-16 00:00:00", "7-25-16 00:00:00", "8-08-16 00:00:00", "8-22-16 00:00:00", "8-29-16 00:00:00", "9-5-16 00:00:00", "9-12-16 00:00:00", "9-19-16 00:00:00", "9-26-16 00:00:00", "10-3-16 00:00:00", "10-10-16 00:00:00", "10-17-16 00:00:00", "10-24-16 00:00:00", "10-31-16 00:00:00", "11-7-16 00:00:00", "11-14-16 00:00:00", "11-21-16 00:00:00", "1-23-17 00:00:00", "1-30-17 00:00:00", "2-06-17 00:00:00", "3-13-17 00:00:00", "3-27-17 00:00:00", "6-19-17 00:00:00", "6-26-17 00:00:00"]
start_date = [pd.to_datetime(d) for d in start_date]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date}); ndf.head()
What I want is values from another data frame that fall within the weeks defined in ndf
. My other dataframe looks something like this:
dates = ["4-17-16 04:00:00", "4-16-16 19:30:00", "4-16-16 19:00:00", "2-24-16 09:00:00", "3-21-16 02:00:00", "3-18-16 10:00:00", "3-24-16 05:00:00", "3-11-16 00:00:00"]
df = pd.DataFrame(
{'timestamp': dates,
'value': np.random.randint(1,25,size=(8,))})
Now I want to create a new data frame that sums all the values
from df
that fall between the dates in ndf
. So I created this function:
def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['timestamp']>ndf['start'])&(df['timestamp']<ndf['end'])]
# Return sum of values
return n.values[0],n['value'].sum()
I also played around with this: n = df[(df['timestamp']>ndf['start'])&(df['timestamp']<ndf['end'])]
. But I get the error: ValueError: Can only compare identically-labeled Series objects
.
I'm looking for someone to help me clean up my function so that it works or provide insight on the error message above. Thanks!
For your specific case where start dates and end dates form one continuous time period, probably you would want to to use something like this:
def get_dates():
# Select the df values between start and ending datetime.
n = df[(df['timestamp'] > ndf['start'].min()) &
(df['timestamp'] < ndf['end'].max())]
# Return sum of values
return n.values[0], n['value'].sum()
And your error says that you are trying to compare arrays of different lengths. 长度比较不同长度的数组。 Your ndf
has 45 rows when df
has 1000
Edit: I am not sure if there is a prettier solution for a discontinuous time period than to iterate over both dataframes:
def get_dates():
count = 0
for index, values_row in df.iterrows():
for _, time_deltas_row in ndf.iterrows():
if time_deltas_row['start'] < values_row['timestamp'] < time_deltas_row['end']:
count += 1
continue
return count
Use resample when you want to group data by evenly-spaced time intervals.
df.set_index('timestamp').resample('w-mon', label='left').sum().reset_index()
Returns:
timestamp value
0 2016-02-22 22.0
1 2016-02-29 NaN
2 2016-03-07 13.0
3 2016-03-14 20.0
4 2016-03-21 9.0
5 2016-03-28 NaN
6 2016-04-04 NaN
7 2016-04-11 34.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.