Updating Dataframe based on another new dataframe

Question

I have 2 dataframes structured in the same way as follows:

df1 = pd.read_csv("Main_Database.csv")
# df1 Columns: ..., Timestamp, Name, Query, Website, Status,...

df2 = pd.read_csv("New_Raw_Results.csv")
# df2 COlumns: ..., Timestamp, Name, Query, Website, Status,...

Both dataframes can have exactly the same columns.

My Main_database.csv keeps track of all records, my new_raw_results is a list of new results that come in every week. I would like to process changes in my main_database based on 3 scenarios:

A) IF Query AND Website in DF2 found in DF1, --> write in DF1 column "Last Seen", using Timestamp from Df2 --> Overwrite Status to "STILL ACTIVE"

B) IF Query AND Website in DF2 not found in DF1, --> append entire df2.row to df1 --> Overwrite Status to "NET NEW"

C) IF Query AND Website in DF1 not found in DF2, --> Overwrite Status to "EXPIRED"

I've tried using a combination of merges and joins, but I'm stuck here. For example, if I isolate in a new dataframe the result of an inner join between these 2 tables, I'm not sure how to use it to take action on my main database. I'm trying to fit all these conditions under one function, so I can use this function to process new entries.

How would you structure this function? What would be the most concise way to approach this problem?

Answer 1

Dataset

import pandas as pd
from numpy.random import default_rng
rng = default_rng()

columns = ['query','website','timestamp','status','last_seen']
data = rng.integers(1,20,(100,5))
df1 = pd.DataFrame(data=data, columns=columns,dtype=str)
data = rng.integers(1,20,(100,5))
df2 = pd.DataFrame(data=data, columns=columns,dtype=str)

Concatenating the query and website columns will facilitate comparisons. eg

      Query   Website
  0  query1  website1  --> 'query1website1'

Make a Series for each DataFrame of the concatenated columns

a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)

Make a boolean Series for each of your three conditions.

cond1 = a.isin(b)    # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)

Set status based on condition 3 - your C)

df1.loc[cond3,'status'] = 'EXPIRED'

Update with new information - your A)

Compare all df2 values ( a ) with all df1 values ( b ) using numpy broadcasting and get the indices where they match.

indices1 = (a.values[:,None] == b.values).argmax(1)

(a.values[:,None] == b.values) results in a 2d boolean array which is a comparison of every a value with every b value. The argmax function returns the indices where they match.

# df1 row indices where df1.qw == df2.qw
x = indices1[indices1 > 0]
# df2 rows where df2.qw == df1.qw
y = df2.loc[np.where(indices1 > 0)]

x is an array of df1 integer indices that have matches in df2 . y is a DataFrame of the matches that correspond with x (a subset of df2 ). Use the integer array to assign new values to the correct df1 rows.

df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"

Caveat: if df1 has multiple rows with the same value for qw , np.argmax will only find the first one and the columns for the second one remain unchanged. Using random data this crops up periodically.

Add new rows - your B)

df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)

Complete...

a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)

cond1 = a.isin(b)    # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)

df1.loc[cond3,'status'] = 'EXPIRED'

indices1 = (a.values[:,None] == b.values).argmax(1)
x = indices1[indices1 > 0]
y = df2.loc[np.where(indices1 > 0)]

df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"

df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)

Answer 2

This should do your stuff:

import pandas as pd

data = [
{"timestamp": 1, "last_seen": 1, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 2, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 3, "status": "XXX", "website": "website3", "query": "query1"},
{"timestamp": 1, "last_seen": 4, "status": "XXX", "website": "website5", "query": "query1"},
{"timestamp": 1, "last_seen": 5, "status": "XXX", "website": "website6", "query": "query1"}
]

new_data = [
{"timestamp": 1, "last_seen": 6, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 7, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 8, "status": "XXX", "website": "website3", "query": "query4"},
{"timestamp": 1, "last_seen": 9, "status": "XXX", "website": "website3", "query": "query8"}
]

df = pd.DataFrame(data)
df_new = pd.DataFrame(new_data)

for i, row in df.iterrows():
    tmp = df_new.loc[(df_new['website'] == row['website']) & (df_new['query'] == row['query'])]
    if not tmp.empty:
        # A)
        df.at[i, 'last_seen'] = tmp['last_seen']
        df.at[i, 'status'] = "STILL ACTIVE"
    else:
        # B)
        df.at[i, 'status'] = "EXPIRED"

for i, row in df_new.iterrows():
    # C)
    tmp = df.loc[(df['website'] == row['website']) & (df['query'] == row['query'])]
    if tmp.empty:
        row["status"] = "NET NEW"
        df = df.append(row, ignore_index=True)

print(df)

Updating Dataframe based on another new dataframe

Question

2 answers

solution1
0 2020-09-25 17:05:08

solution2
0 ACCPTED 2020-09-25 17:11:08

Updating Dataframe based on another new dataframe

Question

2 answers

solution1 0 2020-09-25 17:05:08

solution2 0 ACCPTED 2020-09-25 17:11:08

solution1
0 2020-09-25 17:05:08

solution2
0 ACCPTED 2020-09-25 17:11:08