简体   繁体   English

如何比较和合并三个熊猫数据框?

[英]How do I compare and merge three pandas Data Frames?

A Little bit of Background: 一点背景:

I have three DOORS Modules (A, B, & C) that trace to each other like so: 我有三个DOORS模块(A,B和C),它们相互跟踪,如下所示:

A --> B
A --> C

B --> C
B <-- A

C <-- A
C <-- B

I can easily capture this 'tracing' by exporting out the ID's of other modules that the current module traces to. 通过导出当前模块跟踪到的其他模块的ID,我可以轻松捕获此“跟踪”。 For example, A's exported table might look like so: 例如,A的导出表可能如下所示:

# A Table

|   A   |   B   |   C   |
=========================
|  A_1  |  B_1  |  C_1  |
-------------------------
|  A_2  |       |  C_3  |
-------------------------
|  A_3  |  B_4  |       |
|       |  B_5  |       |
-------------------------

While B and C would look like this: 虽然B和C看起来像这样:

# B Table                       # C Table

|   A   |   B   |   C   |       |   A   |   B   |   C   |
=========================       =========================
|  A_1  |  B_1  |  C_1  |       |  A_1  |  B_1  |  C_1  |
-------------------------       -------------------------
|       |  B_2  |  C_3  |       |  A_2  |       |  C_3  |
-------------------------       |  A_4  |  B_2  |       |
|  A_3  |  B_4  |       |       -------------------------
-------------------------       
|  A_3  |  B_5  |       |       
-------------------------       

Because the tracing between modules might not be complete, I'm looking to find "gaps" in the tables. 由于模块之间的跟踪可能不完整,因此我希望在表中查找“空白”。 For example, A might trace to C and B might trace to C but not to each other. 例如,A可能跟踪到C,而B可能跟踪到C,但不能相互跟踪。

The problem: 问题:

I've been able to capture into a Python DataFrames each table. 我已经能够将每个表捕获到Python DataFrames中。 I'm looking to do two things: 我想做两件事:

  1. Identify missing traces: 识别丢失的痕迹:

    For example, Table A's A_2 has a trace to C_3. 例如,表A的A_2跟踪到C_3。 Table B's B_2 has a trace to C_3. 表B的B_2跟踪到C_3。 However, A_2 and B_2 are not traced to each other. 但是,A_2和B_2 不会相互跟踪。 This is a missing trace. 这是丢失的痕迹。

  2. Merge these results into a single Data Frame instead of three. 将这些结果合并到一个数据框中,而不是三个。

I think the most difficult part of your task is to define what a missing link is. 我认为您任务中最困难的部分是定义缺少的链接。 You might want to devote some time in order to assess various possible configurations since it's not really so straightforward as it might seem (or, on the contrary, it might be pretty simple). 您可能需要花费一些时间来评估各种可能的配置,因为它实际上并不像看起来那样简单(或者相反,它可能非常简单)。

For instance, if table A contains A1,B1, B contains B1,C1, and C contains A1,C1, then how many missing link are here? 例如,如果表A包含A1,B1,B包含B1,C1,C包含A1,C1,那么这里有多少个丢失的链接? or none at all? 还是根本没有? how would it differ if any table contained A1,B1,C1? 如果任何表包含A1,B1,C1,会有什么不同?

Another example: [A1,B1], [B1,C2], [B2,C2]. 另一个示例:[A1,B1],[B1,C2],[B2,C2]。 How many missing links are here? 这里有多少个缺失的链接?

You can easily make many other not so simply to answer examples. 您可以轻松做出许多其他事情,而不仅仅是回答示例。

And when you rigorously define what a missing link is, you can create (perhaps, easily) an algorithm of finding them in your tables, no matter how are they structured: in 3 tables or just in one, which can be formed with a join, append or side-to-side concatenation from original tables. 而且,当您严格定义丢失的链接是什么时,您可以创建(也许很容易)在表中查找它们的算法,无论它们是如何结构的:在3个表中还是在一个表中,都可以通过联接来形成,原始表的追加或并排连接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM