简体   繁体   中英

(Clean code that runs) Getting the mean of values in one dataframe based on date interval and string condition from another dataframe

So, I have a code that is running. But as of now it is a mess and I would like to clean it up. Similar questions have been asked on the forum and I have taken inspiration from several posts and come up with the solution below for my issue.

I have one dataframe containing a lot of data. I want to be able to take the mean of a value in this dataframe if they correspond to a time interval and sourceID id in my original dataframe . Then input the mean value into the original dataframe. To simplify the problem I have just given some small tables below to illustrate the problem.

Dataframe containing data:
precip_data =

sourceID value referenceTime
France 3 2020-01-01
France 6 2020-02-01
France 5 2021-01-01
USA 10 2020-01-01
USA 6 2021-01-01

Original dataframe: df =

date1 date2 Place
2020-02-01 2021-01-01 france
2020-01-01 2021-01-01 usa

The output should be: df =

date1 date2 Place Precipitation
2020-02-01 2021-01-01 france 5.5
2020-01-01 2021-01-01 usa 8

I have a solution to the problem however I would like some help to make it easier:

#fetching data
P_china = precip_data[precip_data['sourceId'].str.contains("china")]
P_usa = precip_data[precip_data['sourceId'].str.contains("usa")]

# fetching corresponds cells
L_france = df.loc[df['Place'] == 'france'
L_usa = df.loc[df['Place'] == 'usa'

#calculating data value
df['Precip_france'] = L_france.apply(lambda s: P_france.query('@s.date1<= referenceTime<=@s.date2').value.mean(), axis=1) 
df['Precip_usa'] = L_usa.apply(lambda s: P_usa.query('@s.date1<= referenceTime <= @s.date2').value.mean(), axis=1)

#produces empty cell and thus i cant sum
df['Precip_usa'] = df['Precip_usa'].replace(np.nan, 0)
df['Precip_france'] = df['Precip_france'].replace(np.nan, 0)

#summing 0 values with value that i want
df['Precipitation'] = df['Precip_usa']+df['Precip_france']

# keeping values of interest
df = df.drop(['Precip_usa', 'Precip_france'], axis = 1)

Not neccessary: but it would be nice to add some sort of sorting in the dataframe where it first sorts based on place and then sorts based on the referenceTime to be 100% I am extracting the correct values. I have inspected the file in excel and it looks ok as of now. But for future applications it could be a good implemntation.

Solution

# Create common merge key
df['key'] = df['Place'].str.lower()
precip_data['key'] = precip_data['sourceID'].str.lower()

# Left Merge the dataframes on common key
m = df.merge(precip_data, on='key', how='left')

# Test the inclusion of referenceTime in [date1, date2]
m['value'] = m['value'].mask(~m['referenceTime'].between(m['date1'], m['date2']))

# Groupby and aggregate the masked column value using mean
out = m.groupby(['date1', 'date2', 'Place'], as_index=False, sort=False)['value'].mean()

>>> out

       date1      date2   Place  value
0 2020-02-01 2021-01-01  france    5.5
1 2020-01-01 2021-01-01     usa    8.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM