(Clean code that runs) Getting the mean of values in one dataframe based on date interval and string condition from another dataframe

Question

So, I have a code that is running. But as of now it is a mess and I would like to clean it up. Similar questions have been asked on the forum and I have taken inspiration from several posts and come up with the solution below for my issue.

I have one dataframe containing a lot of data. I want to be able to take the mean of a value in this dataframe if they correspond to a time interval and sourceID id in my original dataframe . Then input the mean value into the original dataframe. To simplify the problem I have just given some small tables below to illustrate the problem.

Dataframe containing data:
precip_data =

sourceID	value	referenceTime
France	3	2020-01-01
France	6	2020-02-01
France	5	2021-01-01
USA	10	2020-01-01
USA	6	2021-01-01

Original dataframe: df =

date1	date2	Place
2020-02-01	2021-01-01	france
2020-01-01	2021-01-01	usa

The output should be: df =

date1	date2	Place	Precipitation
2020-02-01	2021-01-01	france	5.5
2020-01-01	2021-01-01	usa	8

I have a solution to the problem however I would like some help to make it easier:

#fetching data
P_china = precip_data[precip_data['sourceId'].str.contains("china")]
P_usa = precip_data[precip_data['sourceId'].str.contains("usa")]

# fetching corresponds cells
L_france = df.loc[df['Place'] == 'france'
L_usa = df.loc[df['Place'] == 'usa'

#calculating data value
df['Precip_france'] = L_france.apply(lambda s: P_france.query('@s.date1<= referenceTime<=@s.date2').value.mean(), axis=1) 
df['Precip_usa'] = L_usa.apply(lambda s: P_usa.query('@s.date1<= referenceTime <= @s.date2').value.mean(), axis=1)

#produces empty cell and thus i cant sum
df['Precip_usa'] = df['Precip_usa'].replace(np.nan, 0)
df['Precip_france'] = df['Precip_france'].replace(np.nan, 0)

#summing 0 values with value that i want
df['Precipitation'] = df['Precip_usa']+df['Precip_france']

# keeping values of interest
df = df.drop(['Precip_usa', 'Precip_france'], axis = 1)

Not neccessary: but it would be nice to add some sort of sorting in the dataframe where it first sorts based on place and then sorts based on the referenceTime to be 100% I am extracting the correct values. I have inspected the file in excel and it looks ok as of now. But for future applications it could be a good implemntation.

Answer 1

Solution

# Create common merge key
df['key'] = df['Place'].str.lower()
precip_data['key'] = precip_data['sourceID'].str.lower()

# Left Merge the dataframes on common key
m = df.merge(precip_data, on='key', how='left')

# Test the inclusion of referenceTime in [date1, date2]
m['value'] = m['value'].mask(~m['referenceTime'].between(m['date1'], m['date2']))

# Groupby and aggregate the masked column value using mean
out = m.groupby(['date1', 'date2', 'Place'], as_index=False, sort=False)['value'].mean()

>>> out

       date1      date2   Place  value
0 2020-02-01 2021-01-01  france    5.5
1 2020-01-01 2021-01-01     usa    8.0

(Clean code that runs) Getting the mean of values in one dataframe based on date interval and string condition from another dataframe

Question

1 answers

solution1
0 2021-05-15 11:03:49

Solution

(Clean code that runs) Getting the mean of values in one dataframe based on date interval and string condition from another dataframe

Question

1 answers

solution1 0 2021-05-15 11:03:49

Solution

solution1
0 2021-05-15 11:03:49