I have a datframe like:
id date temperature
1 2011-09-12 12
2011-09-15 12
2011-10-13 12
2 2011-12-12 14
2011-12-24 15
I want to make sure that each device id has temperature recordings for each day, if the value exists it will be copied from above if it doesn't i will put 0.
so, I prepare another dataframe which has dates for the entire year:
using pd.DataFrame(0, index=pd.range('2011-01-01', '2011-12-12'), columns=['temperature'])
date temperature
2011-01-01 0
.
.
.
2011-12-12 0
Now, for each id I want to merge this dataframe so that I have entire year's entry for each of the id.
I am stuck at the merge step, just merging on the date column does not work, ie
pd.merge(df1, df2, on=['date'])
gives a blank dataframe.
Create MultiIndex
by MultiIndex.from_product
and merge by both MultiIndex
es:
mux = pd.MultiIndex.from_product([df.index.levels[0],
pd.date_range('2011-01-01', '2011-12-12')],
names=['id','date'])
df1 = pd.DataFrame(0, index=mux, columns=['temperature'])
df = pd.merge(df1, df, left_index=True, right_index=True, how='left')
If want only one column temperature
:
df = pd.merge(df1, df, left_index=True, right_index=True, how='left', suffixes=('','_'))
df['temperature'] = df.pop('temperature_').fillna(df['temperature'])
Another idea is use itertools.product
for 2 columns
DataFrame:
from itertools import product
data = list(product(df.index.levels[0], pd.date_range('2011-01-01', '2011-12-12')))
df1 = pd.DataFrame(data, columns=['id','date'])
df = pd.merge(df1, df, left_on=['id','date'], right_index=True, how='left')
Another idea is use DataFrame.reindex
:
mux = pd.MultiIndex.from_product([df.index.levels[0],
pd.date_range('2011-01-01', '2011-12-12')],
names=['id','date'])
df = df.reindex(mux, fill_value=0)
As an alternative to jezrael's answer , you could also do the following iteration, especially if you want to keep your device id intact:
data={"date":[pd.Timestamp('2011-09-12'), pd.Timestamp('2011-09-15'), pd.Timestamp('2011-10-13'),pd.Timestamp('2011-12-12'),pd.Timestamp('2011-12-24')],"temperature":[12,12,12,14,15],"sensor_id":[1,1,1,2,2]}
df1=pd.DataFrame(data,index=data["sensor_id"])
df2=pd.DataFrame(0, index=pd.date_range('2011-01-01', '2011-12-12'), columns=['temperature','sensor_id'])
for i,row in df1.iterrows():
df2.loc[df2.index==row["date"], ['temperature']] = row['temperature']
df2.loc[df2.index==row["date"], ['sensor_id']] = row['sensor_id']
for t in data["date"]:
print(df2[df2.index==t])
Note that df2
in your question only goes to 2011-12-12
, hence the last print()
will return an empty DataFrame. I wasn't whether you did this on purpose.
Also, depending on the variability and density in your actual data, it might make sense to use:
for s in [1,2]: ## iterate over device ids
ma=(df['sensor_id']==s)
df.loc[ma]=df.loc[ma].fillna(method='ffill') # fill forward
hence an incomplete time series would be filled (forward) by the last measured temperature value. Depends on the quality of your data, of course, and df.resample()
might make more sense.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.