I have an Excel file in a dataframe old_df
that I keep data up-to-date with by adding new additions from another Excel file dataframe new_df
. I simply pd.concat
the new and the old frames together if one of the dates in the new dataframe doesn't exist in the old.
Currently some of the important columns in this file are:
Pub Date Forecast Time Forecast Date State Temp
2018-12-12 23:00:00 2018-12-20 AK 3
2018-12-12 02:00:00 2018-12-20 AK 3.2
2018-12-12 05:00:00 2018-12-20 AK 2.9
.
.
I want to make sure I pass off duplicate rows when I update this old file with new data - skipping non-unique instances of Pub Date
with Forecast Time
, Forecast Date
and State
.
Right now I'm using a pretty poor method for this by taking in a list of Pub Dates
for the new and the old:
dateList_old = date_old.tolist()
dateList_new = date_new.tolist()
result = any(elm in dateList_new for elm in dateList_old)
if result == True:
print('One or more of the dates already exists in the database')
sys.exit()
else:
frames = [old_df,new_df]
result = pd.concat(frames)
result.to_excel("file", encoding="utf-8", index=False)
But this will run into issues because say if I were to add the same Pub Date
of any kind - it would exit the entire write.
I'd like to make it so that if Pub Date + Forecast Time + Forecast Date + State
is in old_df
then skip and continue writing all other rows that don't exist and exit only if all of these combinations already exist .
Is there an easy way to do this?
You can also use:
df.append(df1,ignore_index=True).drop_duplicates(subset=['Pub Date','Forecast Time','Forecast Date','State'])
Considering the two dataframes as:
df
:
Pub Date Forecast Time Forecast Date State Temp
0 2018-12-12 23:00:00 2018-12-20 AK 3.0
1 2018-12-12 02:00:00 2018-12-20 AK 3.2
2 2018-12-12 05:00:00 2018-12-20 AK 2.9
df1
:
Pub Date Forecast Time Forecast Date State Temp
0 2018-12-12 23:00:00 2018-12-20 AK 3.0
1 2018-12-13 02:00:00 2018-12-20 AK 3.2
2 2018-12-13 05:00:00 2018-12-20 AK 2.9
df.append(df1,ignore_index=True).drop_duplicates(subset=['Pub Date','Forecast Time','Forecast Date','State'])
Pub Date Forecast Time Forecast Date State Temp
0 2018-12-12 23:00:00 2018-12-20 AK 3.0
1 2018-12-12 02:00:00 2018-12-20 AK 3.2
2 2018-12-12 05:00:00 2018-12-20 AK 2.9
4 2018-12-13 02:00:00 2018-12-20 AK 3.2
5 2018-12-13 05:00:00 2018-12-20 AK 2.9
Basically appending both the dataframes and dropping duplicates only based on certain columns which is ['Pub Date','Forecast Time','Forecast Date','State']
To summarize your question: you have two dataframe ("old" and "new") and you want to concatenate rows from "new" that do not already exist in "old" (based on your pub dates, forecast times, etc.). Correct?
You can do logical indexing. For example, identify rows where ALL conditions are met in both dataframe.
idx = ((old['Pub Date'] == new['Pub Date'])
& (old['Forecast Time'] == new['Forecast Time'])
& (old['Forecast Date'] == new['Forecast Date'])
& (old['State'] == new['State'])
if ~np.all(idx==False):
# now concatenate the new data onto the old dataframe.
old = pd.concat([old, new.loc[~idx, :], axis=0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.