Python中两组不同数据的合并方法

Question

This question was edited.这个问题被编辑了。 Please see the edit on the bottom first.请先查看底部的编辑。

This question is going to be a bit long so I'm sorry in advance.这个问题会有点长，所以我提前道歉。 Please consider two different types of data:请考虑两种不同类型的数据：

Data A:数据一：

{
  "files": [
    {
      "name": "abc",
      "valid": [
        "func4",
        "func1",
        "func3"
      ],
      "invalid": [
        "func2",
        "func8"
      ]
    }
  ]
}

Data B:数据乙：

{
  "files": [
    {
      "methods": {
        "invalid": [
          "func2",
          "func8"
        ],
        "valid": [
          "func4",
          "func1",
          "func3"
        ]
      },
      "classes": [
        {
          "invalid": [
            "class1",
            "class2"
          ],
          "valid": [
            "class8",
            "class5"
          ],
          "name": "class1"
        }
      ],
      "name": "abc"
    }
  ]
}

I'm trying to merge each file (A files with A and B files with B).我正在尝试合并每个文件（A 文件与 A 和 B 文件与 B）。 Previous question helped me figure out how to do it but I got stuck again. 上一个问题帮助我弄清楚该怎么做，但我又被卡住了。 As I said in the previous question there is a rule for merging the files.正如我在上一个问题中所说，合并文件有一个规则。 I'll explain again: Consider two dictionaries A1 and A2 .我再解释一下：考虑两个字典A1和A2 。 I want to merge invalid of A1 with A2 and valid of A1 with A2 .我想合并无效的A1与A2和有效的A1与A2 。 The merge should be easy enough but the problem is that the data of invalid and valid dependents on each other.合并应该很容易，但问题是无效和有效的数据相互依赖。 The rule of that dependency - if number x is valid in A1 and invalid in A2 then its valid in the merged report.该依赖项的规则 - 如果数字x在A1有效而在A2无效，则其在合并报告中有效。 The only way to be invalid is to be in the invalid list of both of A1 and A2 (Or invalid in one of them while not existing in the other).唯一无效的方法是同时在A1和A2的无效列表中（或者在其中一个无效而另一个不存在）。 In order to merge the A files I wrote the following code:为了合并 A 文件，我编写了以下代码：

def merge_A_files(self, src_report):
    for current_file in src_report["files"]:
        filename_index = next((index for (index, d) in enumerate(self.A_report["files"]) if d["name"] == current_file["name"]), None)
        if filename_index == None:
            new_block = {}
            new_block['valid'] = current_file['valid']
            new_block['invalid'] = current_file['invalid']
            new_block['name'] = current_file['name']
            self.A_report["files"].append(new_block)
        else:
            block_to_merge = self.A_report["files"][filename_index]
            merged_block = {'valid': [], 'invalid': []}
            merged_block['valid'] = list(set(block_to_merge['valid'] + current_file['valid']))
            merged_block['invalid'] = list({i for l in [block_to_merge['invalid'], current_file['invalid']]
                                              for i in l if i not in merged_block['valid']})
            merged_block['name'] = current_file['name']
            self.A_report["files"][filename_index] = merged_block

For merging B files I wrote:为了合并B文件，我写道：

def _merge_functional_files(self, src_report):
    for current_file in src_report["files"]:
        filename_index = next((index for (index, d) in enumerate(self.B_report["files"]) if d["name"] == current_file["name"]), None)
        if filename_index == None:
            new_block = {'methods': {}, 'classes': []}
            new_block['methods']['valid'] = current_file['methods']['valid']
            new_block['methods']['invalid'] = current_file['methods']['invalid']
            new_block['classes'] += [{'valid': c['valid'],  'invalid': c['invalid'], 'name': c['name'] } for c in current_file['classes']]
            new_block['name'] = current_file['name']
            self.B_report["files"].append(new_block)
        else:
            block_to_merge = self.B_report["files"][filename_index]
            merged_block = {'methods': {}, 'classes': []}
            for current_class in block_to_merge["classes"]:
                current_classname = current_class.get("name")
                class_index = next((index for (index, d) in enumerate(merged_block["classes"]) if d["name"] == current_classname), None)
                if class_index == None:
                    merged_block['classes'] += ([{'valid': c['valid'],  'invalid': c['invalid'], 'name': c['name'] } for c in current_file['classes']])
                else:
                    class_block_to_merge = merged_block["classes"][class_index]
                    class_merged_block = {'valid': [], 'invalid': []}
                    class_merged_block['valid'] = list(set(class_block_to_merge['valid'] + current_class['valid']))
                    class_merged_block['invalid'] = list({i for l in [class_block_to_merge['invalid'], current_class['invalid']]
                                                            for i in l if i not in class_merged_block['valid']})
                    class_merged_block['name'] = current_classname
                    merged_block["classes"][filename_index] = class_merged_block

            merged_block['methods']['valid'] = list(set(block_to_merge['methods']['valid'] + current_file['methods']['valid']))
            merged_block['methods']['invalid'] = list({i for l in [block_to_merge['methods']['invalid'], current_file['methods']['invalid']]
                                                         for i in l if i not in merged_block['methods']['valid']})
            merged_block['name'] = current_file['name']
            self.B_report["files"][filename_index] = merged_block

It looks like the code of A is valid and works as expected.看起来A的代码有效并且按预期工作。 But I have a problem with B , especially with merging classes .但是我对B有问题，尤其是合并classes 。 The example I have problem with:我有问题的例子：

First file:第一个文件：

{
  "files": [
    {
      "name": "some_file1",
      "methods": {
        "valid": [
          "func4",
          "func1"
        ],
        "invalid": [
          "func3"
        ]
      },
      "classes": [
        {
          "name": "class1",
          "valid": [
            "class1",
            "class2"
          ],
          "invalid": [
            "class3",
            "class5"
          ]
        }
      ]
    }
  ]
}

Second file:第二个文件：

{
  "files": [
    {
      "name": "some_file1",
      "methods": {
        "valid": [
          "func4",
          "func1",
          "func3"
        ],
        "invalid": [
          "func2",
          "func8"
        ]
      },
      "classes": [
        {
          "name": "class1",
          "valid": [
            "class8",
            "class5"
          ],
          "invalid": [
            "class1",
            "class2"
          ]
        }
      ]
    }
  ]
}

I get:我得到：

{
  "files": [
    {
      "methods": {
        "invalid": [
          "func2",
          "func8"
        ],
        "valid": [
          "func3",
          "func1",
          "func4"
        ]
      },
      "classes": [
        {
          "invalid": [
            "class5",
            "class3"
          ],
          "valid": [
            "class2",
            "class1"
          ],
          "name": "class1"
        }
      ],
      "name": "some_file1"
    }
  ]
}

But it's wrong because for example class5 should be valid.但这是错误的，因为例如class5应该是有效的。 So my questions are:所以我的问题是：

I would love to have another set of eyes to check my code and help me find out the reason for this issue.我很想有另一组眼睛来检查我的代码并帮助我找出这个问题的原因。
Those two methods got so complicated that it's hard to debug it.这两种方法太复杂了，很难调试。 I would love to see an alternative, less complicated way to achieve it.我很想看到一种替代的、不太复杂的方法来实现它。 Maybe some generic solution?也许一些通用的解决方案？

Edit : My first explanation was too complicated.编辑：我的第一个解释太复杂了。 I'll try to explain what I'm trying to achieve.我将尝试解释我想要实现的目标。 For those of you who read the topic (appreciate it!), please forget about data type A (for simplicity).对于那些阅读该主题的人（欣赏它！），请忘记数据类型 A（为简单起见）。 Consider Data type file B (that was showed at the start).考虑数据类型文件 B（在开始时显示）。 I'm trying to merge a bunch of B files.我正在尝试合并一堆 B 文件。 As I understand, the algorithm for that is to do:据我了解，该算法是这样做的：

Iterate over files.迭代文件。
Check if file already located in the merged dictionary.检查文件是否已位于合并字典中。
- If no, we should add the file block to the files array.如果不是，我们应该将文件块添加到文件数组中。
- If yes:如果是：
  - Merge methods dictionary.合并方法字典。
  - Merge classes array.合并类数组。

To merge methods: method is invalid only if its invalid in both of the block.合并方法：方法只有在两个块中都无效时才无效。 Otherwise, it's valid.否则，它是有效的。

To merge classes: It's getting more complicated because it's an array.合并类：它变得越来越复杂，因为它是一个数组。 I want to follow same rule that I did for methods but I need to find the index of each block in the array, first.我想遵循我对方法所做的相同规则，但我首先需要找到数组中每个块的索引。

The main problem is with merging classes.主要问题是合并类。 Can you please suggest a non-complicated on how to merge B type files?您能否就如何合并 B 类型文件提出一个不复杂的建议？

Answer 1

It would be great if you could provide an expected output for the example you're showing.如果您可以为您展示的示例提供预期的输出，那就太好了。 Based on my understanding, what you're trying to achieves is:根据我的理解，您要实现的是：

You're given multiple JSON files, each contains an "files" entry, which is a list of dictionaries with the structure:您将获得多个 JSON 文件，每个文件都包含一个"files"条目，它是具有以下结构的字典列表：

{
  "name": "file_name",
  "methods": {
    "invalid": ["list", "of", "names"],
    "valid": ["list", "of", "names"]
  },
  "classes": [
    {
      "name": "class_name",
      "invalid": ["list", "of", "names"],
      "valid": ["list", "of", "names"]
    }
  ]
}

You wish to merge structures from multiple files, so that file entries with the same "name" are merged together, according to the following rule:您希望合并来自多个文件的结构，以便根据以下规则将具有相同"name"文件条目合并在一起：
1. For each name inside "methods" : if goes into "valid" if it is in the "valid" array in at least one file entry;对于"methods"每个名称：如果它在至少一个文件条目的"valid"数组中，则进入"valid" ； otherwise if goes into "invalid" .否则，如果进入"invalid" 。
2. Classes with the same "name" are also merged together, and names inside the "valid" and "invalid" arrays are merged according to the above rule.具有相同"name"类也合并在一起， "valid"和"invalid"数组内的名称按照上述规则合并。

The following analysis of your code assumes my understanding as stated above.以下对您的代码的分析假设我的理解如上所述。 Let's look at the code snippet for merging lasses:让我们看一下合并lasses的代码片段：

block_to_merge = self.B_report["files"][filename_index]
merged_block = {'methods': {}, 'classes': []}
for current_class in block_to_merge["classes"]:
    current_classname = current_class.get("name")
    class_index = next((index for (index, d) in enumerate(merged_block["classes"]) if d["name"] == current_classname), None)
    if class_index == None:
        merged_block['classes'] += ([{'valid': c['valid'],  'invalid': c['invalid'], 'name': c['name'] } for c in current_file['classes']])
    else:
        class_block_to_merge = merged_block["classes"][class_index]
        class_merged_block = {'valid': [], 'invalid': []}
        class_merged_block['valid'] = list(set(class_block_to_merge['valid'] + current_class['valid']))
        class_merged_block['invalid'] = list({i for l in [class_block_to_merge['invalid'], current_class['invalid']]
                                                for i in l if i not in class_merged_block['valid']})
        class_merged_block['name'] = current_classname
        merged_block["classes"][filename_index] = class_merged_block

The code is logically incorrect because:该代码在逻辑上不正确，因为：

You're iterating over each class dictionary from block_to_merge["classes"] , which is the previous merged block .您正在迭代来自block_to_merge["classes"]每个类字典，这是前一个合并的 block 。
The new merged block ( merged_block ) is initialized to an empty dictionary.新的合并块 ( merged_block ) 被初始化为一个空字典。
In the case where class_index is None , the class dictionary in merged_block is set to the the class dictionary in the previous merged block.在class_index为None的情况下， merged_block的类字典设置为前一个合并块中的类字典。

If you think about it, class_index will always be None , because current_class is enumerated from block_to_merge["classes"] , which is already merged.如果您考虑一下， class_index将始终为None ，因为current_class是从已经合并的block_to_merge["classes"]枚举的。 Thus, what gets written into the merged_block is only the "classes" entries from the first file entry for a file.因此，写入merged_block的只是文件的第一个文件条目中的"classes"条目。 In your example, you can verify that the "classes" entry is exactly the same as that in the first file.在您的示例中，您可以验证"classes"条目是否与第一个文件中的条目完全相同。

That said, your overall idea of how to merge the files is correct, but implementation-wise it could be a lot more simpler (and efficient).也就是说，您对如何合并文件的总体想法是正确的，但在实现方面可能会更简单（和高效）。 I'll first point out the non-optimal implementations in your code, and then provide a simpler solution.我将首先指出您代码中的非最佳实现，然后提供一个更简单的解决方案。

You're directly storing the data in its output form , however, it's not a form that is efficient for your task.您直接将数据存储在其输出表单中，但是，它不是一种对您的任务有效的表单。 It's perfectly fine to store them in a form that is efficient, and then apply post-processing to transform it into the output form.以高效的形式存储它们，然后应用后处理将其转换为输出形式是完全没问题的。 For instance:例如：
- You're using next to find an existing entry in the list with the same "name" , but this could take linear time.您正在使用next在列表中查找具有相同"name"的现有条目，但这可能需要线性时间。 Instead, you can store these in a dictionary, with "name" as keys.相反，您可以将它们存储在字典中，以"name"作为键。
- You're also storing valid & invalid names as a list.您还将有效和无效名称存储为列表。 While merging, it's converted into a set and then back into a list.合并时，它会被转换为一个集合，然后再转换回一个列表。 This results in a large number of redundant copies.这导致大量冗余副本。 Instead, you can just store them as sets.相反，您可以将它们存储为集合。
You have some duplicate routines that could have been extracted into functions, but instead you rewrote them wherever needed.您有一些重复的例程本可以提取到函数中，但是您在需要的地方重写了它们。 This violates the DRY principle and increases your chances of introducing bugs.这违反了DRY 原则并增加了引入错误的机会。

A revised version of the code is as follows:修改后的代码如下：

class Merger:
    def __init__(self):
        # A structure optimized for efficiency:
        # dict (file_name) -> {
        #   "methods": {
        #     "valid": set(names),
        #     "invalid": set(names),
        #   }
        #   "classes": dict (class_name) -> {
        #     "valid": set(names),
        #     "invalid": set(names),
        #   }
        # }
        self.file_dict = {}

    def _create_entry(self, new_entry):
        return {
            "valid": set(new_entry["valid"]),
            "invalid": set(new_entry["invalid"]),
        }

    def _merge_entry(self, merged_entry, new_entry):
        merged_entry["valid"].update(new_entry["valid"])
        merged_entry["invalid"].difference_update(new_entry["valid"])
        for name in new_entry["invalid"]:
            if name not in merged_entry["valid"]:
                merged_entry["invalid"].add(name)

    def merge_file(self, src_report):
        # Method called to merge one file.
        for current_file in src_report["files"]:
            file_name = current_file["name"]
            # Merge methods.
            if file_name not in self.file_dict:
                self.file_dict[file_name] = {
                    "methods": self._create_entry(current_file["methods"]),
                    "classes": {},
                }
            else:
                self._merge_entry(self.file_dict[file_name]["methods"], current_file["methods"])
            # Merge classes.
            file_class_entry = self.file_dict[file_name]["classes"]
            for class_entry in current_file["classes"]:
                class_name = class_entry["name"]
                if class_name not in file_class_entry:
                    file_class_entry[class_name] = self._create_entry(class_entry)
                else:
                    self._merge_entry(file_class_entry[class_name], class_entry)

    def post_process(self):
        # Method called after all files are merged, and returns the data in its output form.
        return [
            {
                "name": file_name,
                "methods": {
                    "valid": list(file_entry["methods"]["valid"]),
                    "invalid": list(file_entry["methods"]["invalid"]),
                },
                "classes": [
                    {
                        "name": class_name,
                        "valid": list(class_entry["valid"]),
                        "invalid": list(class_entry["invalid"]),
                    }
                    for class_name, class_entry in file_entry["classes"].items()
                ],
            }
            for file_name, file_entry in self.file_dict.items()
        ]

We can test the implementation by:我们可以通过以下方式测试实现：

def main():
    a = {
      "files": [
        {
          "name": "some_file1",
          "methods": {
            "valid": [
              "func4",
              "func1"
            ],
            "invalid": [
              "func3"
            ]
          },
          "classes": [
            {
              "name": "class1",
              "valid": [
                "class1",
                "class2"
              ],
              "invalid": [
                "class3",
                "class5"
              ]
            }
          ]
        }
      ]
    }
    b = {
      "files": [
        {
          "name": "some_file1",
          "methods": {
            "valid": [
              "func4",
              "func1",
              "func3"
            ],
            "invalid": [
              "func2",
              "func8"
            ]
          },
          "classes": [
            {
              "name": "class1",
              "valid": [
                "class8",
                "class5"
              ],
              "invalid": [
                "class1",
                "class2"
              ]
            }
          ]
        }
      ]
    }
    import pprint
    merge = Merger()
    merge.merge_file(a)
    merge.merge_file(b)
    output = merge.post_process()
    pprint.pprint(output)


if __name__ == '__main__':
    main()

The output is:输出是：

[{'classes': [{'invalid': ['class3'],
               'name': 'class1',
               'valid': ['class2', 'class5', 'class8', 'class1']}],
  'methods': {'invalid': ['func2', 'func8'],
              'valid': ['func1', 'func4', 'func3']},
  'name': 'some_file1'}]

Python中两组不同数据的合并方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-13 23:31:58

Python中两组不同数据的合并方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-13 23:31:58

解决方案1
1 已采纳 2020-03-13 23:31:58