简体   繁体   中英

Checker for updating already existing data in dataframe Python

I have an Excel file in a dataframe old_df that I keep data up-to-date with by adding new additions from another Excel file dataframe new_df . I simply pd.concat the new and the old frames together if one of the dates in the new dataframe doesn't exist in the old.

Currently some of the important columns in this file are:

Pub Date      Forecast Time   Forecast Date   State   Temp
2018-12-12    23:00:00        2018-12-20      AK      3
2018-12-12    02:00:00        2018-12-20      AK      3.2
2018-12-12    05:00:00        2018-12-20      AK      2.9
.
.

I want to make sure I pass off duplicate rows when I update this old file with new data - skipping non-unique instances of Pub Date with Forecast Time , Forecast Date and State .

Right now I'm using a pretty poor method for this by taking in a list of Pub Dates for the new and the old:

dateList_old = date_old.tolist()
dateList_new = date_new.tolist()

result = any(elm in dateList_new for elm in dateList_old)

if result == True:
    print('One or more of the dates already exists in the database')
    sys.exit()

else:

    frames = [old_df,new_df]

    result = pd.concat(frames)
    result.to_excel("file", encoding="utf-8", index=False)

But this will run into issues because say if I were to add the same Pub Date of any kind - it would exit the entire write.

I'd like to make it so that if Pub Date + Forecast Time + Forecast Date + State is in old_df then skip and continue writing all other rows that don't exist and exit only if all of these combinations already exist .

Is there an easy way to do this?

You can also use:

df.append(df1,ignore_index=True).drop_duplicates(subset=['Pub Date','Forecast Time','Forecast Date','State'])

Considering the two dataframes as:

df :

    Pub Date Forecast Time Forecast Date State  Temp
0 2018-12-12      23:00:00    2018-12-20    AK   3.0
1 2018-12-12      02:00:00    2018-12-20    AK   3.2
2 2018-12-12      05:00:00    2018-12-20    AK   2.9

df1 :

    Pub Date Forecast Time Forecast Date State  Temp
0 2018-12-12      23:00:00    2018-12-20    AK   3.0
1 2018-12-13      02:00:00    2018-12-20    AK   3.2
2 2018-12-13      05:00:00    2018-12-20    AK   2.9

df.append(df1,ignore_index=True).drop_duplicates(subset=['Pub Date','Forecast Time','Forecast Date','State'])

    Pub Date Forecast Time Forecast Date State  Temp
0 2018-12-12      23:00:00    2018-12-20    AK   3.0
1 2018-12-12      02:00:00    2018-12-20    AK   3.2
2 2018-12-12      05:00:00    2018-12-20    AK   2.9
4 2018-12-13      02:00:00    2018-12-20    AK   3.2
5 2018-12-13      05:00:00    2018-12-20    AK   2.9

Basically appending both the dataframes and dropping duplicates only based on certain columns which is ['Pub Date','Forecast Time','Forecast Date','State']

To summarize your question: you have two dataframe ("old" and "new") and you want to concatenate rows from "new" that do not already exist in "old" (based on your pub dates, forecast times, etc.). Correct?

You can do logical indexing. For example, identify rows where ALL conditions are met in both dataframe.

idx = ((old['Pub Date'] == new['Pub Date'])
       & (old['Forecast Time'] == new['Forecast Time'])
       & (old['Forecast Date'] == new['Forecast Date'])
       & (old['State'] == new['State'])

if ~np.all(idx==False):
    # now concatenate the new data onto the old dataframe.
    old = pd.concat([old, new.loc[~idx, :], axis=0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM