So, I have a code that is running. But as of now it is a mess and I would like to clean it up. Similar questions have been asked on the forum and I have taken inspiration from several posts and come up with the solution below for my issue.
I have one dataframe containing a lot of data. I want to be able to take the mean of a value in this dataframe if they correspond to a time interval and sourceID id in my original dataframe . Then input the mean value into the original dataframe. To simplify the problem I have just given some small tables below to illustrate the problem.
Dataframe containing data:
precip_data =
sourceID | value | referenceTime |
---|---|---|
France | 3 | 2020-01-01 |
France | 6 | 2020-02-01 |
France | 5 | 2021-01-01 |
USA | 10 | 2020-01-01 |
USA | 6 | 2021-01-01 |
Original dataframe: df =
date1 | date2 | Place |
---|---|---|
2020-02-01 | 2021-01-01 | france |
2020-01-01 | 2021-01-01 | usa |
The output should be: df =
date1 | date2 | Place | Precipitation |
---|---|---|---|
2020-02-01 | 2021-01-01 | france | 5.5 |
2020-01-01 | 2021-01-01 | usa | 8 |
I have a solution to the problem however I would like some help to make it easier:
#fetching data
P_china = precip_data[precip_data['sourceId'].str.contains("china")]
P_usa = precip_data[precip_data['sourceId'].str.contains("usa")]
# fetching corresponds cells
L_france = df.loc[df['Place'] == 'france'
L_usa = df.loc[df['Place'] == 'usa'
#calculating data value
df['Precip_france'] = L_france.apply(lambda s: P_france.query('@s.date1<= referenceTime<=@s.date2').value.mean(), axis=1)
df['Precip_usa'] = L_usa.apply(lambda s: P_usa.query('@s.date1<= referenceTime <= @s.date2').value.mean(), axis=1)
#produces empty cell and thus i cant sum
df['Precip_usa'] = df['Precip_usa'].replace(np.nan, 0)
df['Precip_france'] = df['Precip_france'].replace(np.nan, 0)
#summing 0 values with value that i want
df['Precipitation'] = df['Precip_usa']+df['Precip_france']
# keeping values of interest
df = df.drop(['Precip_usa', 'Precip_france'], axis = 1)
Not neccessary: but it would be nice to add some sort of sorting in the dataframe where it first sorts based on place and then sorts based on the referenceTime to be 100% I am extracting the correct values. I have inspected the file in excel and it looks ok as of now. But for future applications it could be a good implemntation.
# Create common merge key
df['key'] = df['Place'].str.lower()
precip_data['key'] = precip_data['sourceID'].str.lower()
# Left Merge the dataframes on common key
m = df.merge(precip_data, on='key', how='left')
# Test the inclusion of referenceTime in [date1, date2]
m['value'] = m['value'].mask(~m['referenceTime'].between(m['date1'], m['date2']))
# Groupby and aggregate the masked column value using mean
out = m.groupby(['date1', 'date2', 'Place'], as_index=False, sort=False)['value'].mean()
>>> out
date1 date2 Place value
0 2020-02-01 2021-01-01 france 5.5
1 2020-01-01 2021-01-01 usa 8.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.