简体   繁体   中英

Updating Dataframe based on another new dataframe

I have 2 dataframes structured in the same way as follows:

df1 = pd.read_csv("Main_Database.csv")
# df1 Columns: ..., Timestamp, Name, Query, Website, Status,...

df2 = pd.read_csv("New_Raw_Results.csv")
# df2 COlumns: ..., Timestamp, Name, Query, Website, Status,...

Both dataframes can have exactly the same columns.

My Main_database.csv keeps track of all records, my new_raw_results is a list of new results that come in every week. I would like to process changes in my main_database based on 3 scenarios:

A) IF Query AND Website in DF2 found in DF1, --> write in DF1 column "Last Seen", using Timestamp from Df2 --> Overwrite Status to "STILL ACTIVE"

B) IF Query AND Website in DF2 not found in DF1, --> append entire df2.row to df1 --> Overwrite Status to "NET NEW"

C) IF Query AND Website in DF1 not found in DF2, --> Overwrite Status to "EXPIRED"

I've tried using a combination of merges and joins, but I'm stuck here. For example, if I isolate in a new dataframe the result of an inner join between these 2 tables, I'm not sure how to use it to take action on my main database. I'm trying to fit all these conditions under one function, so I can use this function to process new entries.

How would you structure this function? What would be the most concise way to approach this problem?

Dataset

import pandas as pd
from numpy.random import default_rng
rng = default_rng()

columns = ['query','website','timestamp','status','last_seen']
data = rng.integers(1,20,(100,5))
df1 = pd.DataFrame(data=data, columns=columns,dtype=str)
data = rng.integers(1,20,(100,5))
df2 = pd.DataFrame(data=data, columns=columns,dtype=str)

Concatenating the query and website columns will facilitate comparisons. eg

      Query   Website
  0  query1  website1  --> 'query1website1'

Make a Series for each DataFrame of the concatenated columns

a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)

Make a boolean Series for each of your three conditions.

cond1 = a.isin(b)    # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)

Set status based on condition 3 - your C)

df1.loc[cond3,'status'] = 'EXPIRED'

Update with new information - your A)

Compare all df2 values ( a ) with all df1 values ( b ) using numpy broadcasting and get the indices where they match.

indices1 = (a.values[:,None] == b.values).argmax(1)

(a.values[:,None] == b.values) results in a 2d boolean array which is a comparison of every a value with every b value. The argmax function returns the indices where they match.

# df1 row indices where df1.qw == df2.qw
x = indices1[indices1 > 0]
# df2 rows where df2.qw == df1.qw
y = df2.loc[np.where(indices1 > 0)]

x is an array of df1 integer indices that have matches in df2 . y is a DataFrame of the matches that correspond with x (a subset of df2 ). Use the integer array to assign new values to the correct df1 rows.

df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"

Caveat: if df1 has multiple rows with the same value for qw , np.argmax will only find the first one and the columns for the second one remain unchanged. Using random data this crops up periodically.


Add new rows - your B)

df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)

Complete...

a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)

cond1 = a.isin(b)    # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)

df1.loc[cond3,'status'] = 'EXPIRED'

indices1 = (a.values[:,None] == b.values).argmax(1)
x = indices1[indices1 > 0]
y = df2.loc[np.where(indices1 > 0)]

df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"

df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)

This should do your stuff:

import pandas as pd

data = [
{"timestamp": 1, "last_seen": 1, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 2, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 3, "status": "XXX", "website": "website3", "query": "query1"},
{"timestamp": 1, "last_seen": 4, "status": "XXX", "website": "website5", "query": "query1"},
{"timestamp": 1, "last_seen": 5, "status": "XXX", "website": "website6", "query": "query1"}
]

new_data = [
{"timestamp": 1, "last_seen": 6, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 7, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 8, "status": "XXX", "website": "website3", "query": "query4"},
{"timestamp": 1, "last_seen": 9, "status": "XXX", "website": "website3", "query": "query8"}
]

df = pd.DataFrame(data)
df_new = pd.DataFrame(new_data)

for i, row in df.iterrows():
    tmp = df_new.loc[(df_new['website'] == row['website']) & (df_new['query'] == row['query'])]
    if not tmp.empty:
        # A)
        df.at[i, 'last_seen'] = tmp['last_seen']
        df.at[i, 'status'] = "STILL ACTIVE"
    else:
        # B)
        df.at[i, 'status'] = "EXPIRED"

for i, row in df_new.iterrows():
    # C)
    tmp = df.loc[(df['website'] == row['website']) & (df['query'] == row['query'])]
    if tmp.empty:
        row["status"] = "NET NEW"
        df = df.append(row, ignore_index=True)

print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM