[英]Compare columns of two dataframes and create a new dataframe
I have two different dataframes and i want to compare some columns for every row in df A我有两个不同的数据框,我想比较 df A 中每一行的一些列
Dataframe A:数据框 A:
M_ID From To M_Type T_Type T_Length T_Weight #Trucks Loading_Time
1025 A B Boxes Open 12-Tyre 22 3 27-March-2019 6:00PM
1029 C D Cylinders Trailer High 23 2 28-March-2019 6:00PM
1989 G H Scrap Open 14-Tyre 25 5 26-March-2019 9:00PM
Dataframe B数据框 B
T_ID From To T_Type T_Length T_Weight #Trucks Price
6569 A B Open 12-Tyre 22 5 1500
8658 G H Open 14-Tyre 25 4 1800
4595 A B Open 12-Tyre 22 3 1400
1252 A B Trailer Low 28 5 2000
7754 C D Trailer High 23 4 1900
3632 G H Open 14-Tyre 25 10 2000
6521 C D Trailer High 23 8 1700
8971 C D Open 12-Tyre 22 8 1200
4862 G H Trailer High 25 15 2200
I want to compare certain columns of A and B ie "From, To, T_Type, T_length, T_Weight, #Trucks"我想比较 A 和 B 的某些列,即“From、To、T_Type、T_length、T_Weight、#Trucks”
"From, To, T_Type, T_length, T_Weight" of both dataframes has to be equal but B[#Trucks]>=A[#Trucks] and when this condition is true it should sort the matches by price and create a new dataframe with M_ID and T_ID like this两个数据帧的“From, To, T_Type, T_length, T_Weight”必须相等,但 B[#Trucks]>=A[#Trucks] 并且当此条件为真时,它应该按价格对匹配项进行排序并创建一个新的数据帧M_ID 和 T_ID 像这样
Datframe Results数据框结果
Manufacturer Best_match Second_best_match
1025 4595 6569
1029 6521 7754
1989 3632 -
you could try:你可以试试:
dfc = pd.merge(dfa, dfb, on=['From', 'To', 'T_Type', 'T_Length', 'T_Weight'], how='inner')
dfc.drop(['From', 'To', 'M_Type', 'T_Weight', 'T_Length', 'Loading_Time', 'T_Type'], axis = 1,inplace=True)
dfc = dfc[dfc['#Trucks_y'] >= dfc['#Trucks_x']].drop(['#Trucks_y', '#Trucks_x'], axis=1)
dfc.rename(columns={"M_ID": "Manufacturer", "T_ID": "BestMatches"}, inplace=True)
dfc = dfc.groupby(['Manufacturer', 'Price'])['BestMatches'].agg('first').reset_index().drop(['Price'], axis = 1)
dfc = dfc.groupby(['Manufacturer'])['BestMatches'].agg(list).reset_index()
dfd = dfc['BestMatches'].apply(pd.Series)
dfc.drop(["BestMatches"],axis = 1,inplace = True)
dfc = dfc.join(dfd).fillna('-')
print(dfc)
output:输出:
Manufacturer 0 1
0 1025 4595.0 6569.0
1 1029 6521.0 7754.0
2 1989 3632.0 -
If you want to check equals values on a certain column let's say Name you can merge both Dataframes to a new one:如果您想检查某个列上的等于值,假设您可以将两个 Dataframes 合并到一个新的 Dataframes 中:
mergedStuff = pd.merge(df1, df2, on=['Name'], how='inner')
mergedStuff.head()
I think this is more efficient and faster then where
if you have a big data set我认为这比拥有大数据集的
where
更高效、更快
and if you want to get the differences you can do something like this:如果您想获得差异,您可以执行以下操作:
This approach, df1 != df2
, works only for dataframes with identical rows and columns.这种方法
df1 != df2
仅适用于具有相同行和列的数据帧。 In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.事实上,所有数据帧轴都与
_indexed_same
方法进行比较,如果发现差异,即使在列/索引顺序中也会引发异常。
If I got you right, you want not to find changes, but symmetric difference.如果我猜对了,您不想找到变化,而是要找到对称差异。 For that, one approach might be concatenate dataframes:
为此,一种方法可能是连接数据帧:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by通过...分组
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records获取唯一记录的索引
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.