简体   繁体   English

基于另一个新数据帧更新数据帧

[英]Updating Dataframe based on another new dataframe

I have 2 dataframes structured in the same way as follows:我有 2 个数据帧的结构如下:

df1 = pd.read_csv("Main_Database.csv")
# df1 Columns: ..., Timestamp, Name, Query, Website, Status,...

df2 = pd.read_csv("New_Raw_Results.csv")
# df2 COlumns: ..., Timestamp, Name, Query, Website, Status,...

Both dataframes can have exactly the same columns.两个数据框可以具有完全相同的列。

My Main_database.csv keeps track of all records, my new_raw_results is a list of new results that come in every week.我的Main_database.csv跟踪所有记录,我的new_raw_results是每周出现的新结果列表。 I would like to process changes in my main_database based on 3 scenarios:我想根据 3 个场景处理我的main_database中的更改:

A) IF Query AND Website in DF2 found in DF1, --> write in DF1 column "Last Seen", using Timestamp from Df2 --> Overwrite Status to "STILL ACTIVE" A) 如果在 DF1 中找到 DF2 中的查询和网站,--> 在 DF1 列“Last Seen”中写入,使用来自 Df2 的时间戳 --> 将状态覆盖为"STILL ACTIVE"

B) IF Query AND Website in DF2 not found in DF1, --> append entire df2.row to df1 --> Overwrite Status to "NET NEW" B) 如果在 DF1 中找不到 DF2 中的查询和网站,--> 将整个 df2.row 附加到 df1--> 将状态覆盖为"NET NEW"

C) IF Query AND Website in DF1 not found in DF2, --> Overwrite Status to "EXPIRED" C) 如果在 DF2 中找不到 DF1 中的查询和网站,--> 将状态覆盖为"EXPIRED"

I've tried using a combination of merges and joins, but I'm stuck here.我尝试使用合并和连接的组合,但我被困在这里。 For example, if I isolate in a new dataframe the result of an inner join between these 2 tables, I'm not sure how to use it to take action on my main database.例如,如果我在新数据框中隔离这两个表之间的内部连接结果,我不确定如何使用它对我的主数据库执行操作。 I'm trying to fit all these conditions under one function, so I can use this function to process new entries.我试图在一个函数下满足所有这些条件,所以我可以使用这个函数来处理新条目。

How would you structure this function?你会如何构建这个函数? What would be the most concise way to approach this problem?解决这个问题的最简洁的方法是什么?

Dataset数据集

import pandas as pd
from numpy.random import default_rng
rng = default_rng()

columns = ['query','website','timestamp','status','last_seen']
data = rng.integers(1,20,(100,5))
df1 = pd.DataFrame(data=data, columns=columns,dtype=str)
data = rng.integers(1,20,(100,5))
df2 = pd.DataFrame(data=data, columns=columns,dtype=str)

Concatenating the query and website columns will facilitate comparisons.连接querywebsite列将有助于比较。 eg例如

      Query   Website
  0  query1  website1  --> 'query1website1'

Make a Series for each DataFrame of the concatenated columns为连接列的每个 DataFrame 制作一个系列

a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)

Make a boolean Series for each of your three conditions.为您的三个条件中的每一个创建一个布尔系列。

cond1 = a.isin(b)    # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)

Set status based on condition 3 - your C)根据条件 3 设置状态 - 您的C)

df1.loc[cond3,'status'] = 'EXPIRED'

Update with new information - your A)更新新信息 - 你的A)

Compare all df2 values ( a ) with all df1 values ( b ) using numpy broadcasting and get the indices where they match.使用 numpy广播将所有 df2 值 ( a ) 与所有 df1 值 ( b ) 进行比较,并获取它们匹配的索引。

indices1 = (a.values[:,None] == b.values).argmax(1)

(a.values[:,None] == b.values) results in a 2d boolean array which is a comparison of every a value with every b value. (a.values[:,None] == b.values)产生一个二(a.values[:,None] == b.values)数组,它是每个a值与每个b值的比较。 The argmax function returns the indices where they match. argmax函数返回它们匹配的索引。

# df1 row indices where df1.qw == df2.qw
x = indices1[indices1 > 0]
# df2 rows where df2.qw == df1.qw
y = df2.loc[np.where(indices1 > 0)]

x is an array of df1 integer indices that have matches in df2 . x是一个df1整数索引数组,在df2中有匹配项 y is a DataFrame of the matches that correspond with x (a subset of df2 ). y是与xdf2的子集)对应的匹配项的数据帧。 Use the integer array to assign new values to the correct df1 rows.使用整数数组将新值分配给正确的df1行。

df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"

Caveat: if df1 has multiple rows with the same value for qw , np.argmax will only find the first one and the columns for the second one remain unchanged.警告:如果 df1 有多行qw具有相同的值, np.argmax 只会找到第一行,而第二行的列保持不变。 Using random data this crops up periodically.使用随机数据会定期出现。


Add new rows - your B)添加新行 - 你的B)

df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)

Complete...完全的...

a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)

cond1 = a.isin(b)    # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)

df1.loc[cond3,'status'] = 'EXPIRED'

indices1 = (a.values[:,None] == b.values).argmax(1)
x = indices1[indices1 > 0]
y = df2.loc[np.where(indices1 > 0)]

df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"

df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)

This should do your stuff:这应该做你的事情:

import pandas as pd

data = [
{"timestamp": 1, "last_seen": 1, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 2, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 3, "status": "XXX", "website": "website3", "query": "query1"},
{"timestamp": 1, "last_seen": 4, "status": "XXX", "website": "website5", "query": "query1"},
{"timestamp": 1, "last_seen": 5, "status": "XXX", "website": "website6", "query": "query1"}
]

new_data = [
{"timestamp": 1, "last_seen": 6, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 7, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 8, "status": "XXX", "website": "website3", "query": "query4"},
{"timestamp": 1, "last_seen": 9, "status": "XXX", "website": "website3", "query": "query8"}
]

df = pd.DataFrame(data)
df_new = pd.DataFrame(new_data)

for i, row in df.iterrows():
    tmp = df_new.loc[(df_new['website'] == row['website']) & (df_new['query'] == row['query'])]
    if not tmp.empty:
        # A)
        df.at[i, 'last_seen'] = tmp['last_seen']
        df.at[i, 'status'] = "STILL ACTIVE"
    else:
        # B)
        df.at[i, 'status'] = "EXPIRED"

for i, row in df_new.iterrows():
    # C)
    tmp = df.loc[(df['website'] == row['website']) & (df['query'] == row['query'])]
    if tmp.empty:
        row["status"] = "NET NEW"
        df = df.append(row, ignore_index=True)

print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM