[英]Find corresponding values in two DataFrames
我要在两个Pandas DataFrame中找到对应的值。
输入:df1:
server system directions msgTYPE msgID count
0 1 sys1_in in ADT MSG0001 1
1 1 sys1_in in ADT MSG0002 1
2 1 sys1_in in ADT MSG0003 1
3 1 sys1_in in ADT MSG0004 1
df2:
server system directions msgTYPE msgID count
0 1 sys2_out in ADT MSG0001 1
1 1 sys2_out in ADT MSG0001 1
2 1 sys3_out in ADT MSG0003 1
3 1 sys4_out in ADT MSG0004 1
Output 应该是:
system_in system_out count
0 sys1_in sys2_out 2
1 sys1_in sys3_out 1
2 sys1_in sys4_out 1
所以我必须从两个 DF 中构建一个 df,其中的列具有输入和输出系统,并且与 mesgID 相关。
我正在使用 df.itertuples 和 df.goupby 来制作它:
model = pd.DataFrame(columns=['in', 'out', 'count'])
for item in ins.itertuples(index=True, name='Pandas'):
selected = outs.query('msgID == "%s"' % (getattr(item, "msgID")))
for row in selected.itertuples(index=True, name='Pandas2'):
model = model.append({'in': getattr(item, "system"), 'out': getattr(row, "system"), 'count': 1},
ignore_index=True)
result = model.groupby(['in', 'out'])['count'].sum().reset_index()
它有效,但效率极低,输入 Frames(df1, df2) 大约有 200 万行。 有人知道在 Pandas 框架中构建的更有效的方法吗?
干杯。
您可以通过首先按相应列merging
数据帧然后使用GroupBy
和named_aggregations
来实现这一点( pandas >= 0.25.0
中的新功能):
columns = [col for col in df1.columns if col != 'system']
mrg = df1.merge(df2, on=columns, suffixes=['_in', '_out'])
mrg.groupby(columns).agg(
system_in=('system_in', 'first'),
system_out=('system_out', 'first'),
count=('system_in', 'size')
).reset_index(drop=True)
Output
system_in system_out count
0 sys1_in sys2_out 2
1 sys1_in sys3_out 1
2 sys1_in sys4_out 1
如果要将列保留为信息,只需使用merge
和GroupBy.count
:
df1.merge(df2, on=columns, suffixes=['_in', '_out'])\
.groupby(columns, as_index=False).count()
Output
server directions msgTYPE msgID count system_in system_out
0 1 in ADT MSG0001 1 2 2
1 1 in ADT MSG0003 1 1 1
2 1 in ADT MSG0004 1 1 1
Use DataFrame.merge
to join both dataframe based on common columns.Then you can use DataFrame.groupby
to count by groupby.count
DataFrame.reindex
to correctly sort the columns:
( df1.merge(df2,on=['server','msgID','directions','msgTYPE','count'],suffixes=['_in','_out'])
.groupby(['server','msgID','directions','msgTYPE']).count().reset_index(drop=True)
.reindex(columns=['system_in','system_out','count']) )
system_in system_out count
0 2 2 2
1 1 1 1
2 1 1 1
或DataFrame.reset_index
与drop=False
(默认值)保留列的 rest:
( df1.merge(df2,on=['server','msgID','directions','msgTYPE','count'],suffixes=['_in','_out'])
.groupby(['server','msgID','directions','msgTYPE']).count().reindex(columns=['system_in','system_out','count'])
.reset_index() )
server msgID directions msgTYPE system_in system_out count
0 1 MSG0001 in ADT 2 2 2
1 1 MSG0003 in ADT 1 1 1
2 1 MSG0004 in ADT 1 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.