在同一键下查找字典中值的交集 - python

Question

I have a file with lines, each line is split on "|", I want to compare arguments 5 from each line and if intersect, then proceed.我有一个带行的文件，每一行都在“|”上分割，我想比较每行的 arguments 5 如果相交，然后继续。 This gets to a second part:这进入第二部分：

first arguments1,2 are compared by dictionary, and if they are same AND第一个 arguments1,2 通过字典进行比较，如果它们相同 AND
if arguments5,6 are overlapping, then those lines get concatenated.如果 arguments5,6 重叠，那么这些行将被连接起来。

How to compare intersection of values under the same key?如何比较同一键下的值的交集？ The code below works cross-key but not within same key:下面的代码跨键工作，但不在同一个键内：

from functools import reduce 
reduce(set.intersection, (set(val) for val in query_dict.values()))

In other words, only those lines that are matching by 1st,2nd arguments (cells equal) and if 5th,6th arguments are overlapping (intersection) are concatenated.换句话说，只有那些与第 1、2 个 arguments（单元格相等）匹配的行以及如果第 5、6 个 arguments 重叠（相交）的行被连接起来。

Here is input file:这是输入文件：

Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']

The output should look like: output 应如下所示：

Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...

Angela Darvill gets stacked because two records share same name, same country and same city(-ies).安吉拉·达维尔（Angela Darvill）被堆积，因为两条记录共享相同的名称、相同的国家和相同的城市（-ies）。

Answer 1

from itertools import zip_longest


data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""

lines = tuple(line.split('|') for line in data.splitlines())
number_of_lines = len(lines)
print(f"number of lines : {number_of_lines}")
print(f"number of cells in line 1 : {len(lines[0])}")
print(f"number of cells in line 2 : {len(lines[1])}")
print(f"{lines[0]=}")
print(f"{lines[1]=}")

result = []

# we want to compare each line with each other :
for line_a_index, line_a in enumerate(lines):
    for line_b_index, line_b in enumerate(lines[line_a_index+1:]):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
                line_a[5] in line_b[5] or line_a[6] in line_b[6]  # A in B
            or line_b[5] in line_a[5] or line_b[6] in line_a[6]  # B in A
        ):
            result.append(tuple(
                ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
                for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1])  # <-- here I truncated the lines
            ))

# I decided to have a fancy output, but I made some simplifying assumptions to make it simple
if len(result) > 1:
    raise NotImplementedError
widths = tuple(max(len(a) if a is not None else 0, len(b) if b is not None else 0, len(c) if c is not None else 0)
               for a, b, c in zip_longest(lines[0], lines[1], result[0]))
length = max(len(lines[0]), len(lines[1]), len(result[0]))
for line in (lines[0], lines[1], result[0]):
    for index, cell in zip_longest(range(length), line):
        if cell:
            print(cell.ljust(widths[index]), end='|')
    print("", end='\n')  # explicit newline

original_expected_output = "text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6"
print(f"{original_expected_output}         <-- expected")

lenormju_expected_output = "text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7"
print(f"{lenormju_expected_output}             <-- fixed")

output output

number of lines : 2
number of cells in line 1 : 7
number of cells in line 2 : 13
lines[0]=['text1', 'text2', 'text3', 'text4 text 5', ' text 6', ' text 7 text 8', '']
lines[1]=['text1', 'text2', 'text12', 'text4 text 5', ' text 6', ' text 7', ' text9', 'text10', 'text3', 'text4 text 5', ' text 11', ' text 12 text 8', '']
text1|text2|text3       |text4 text 5| text 6| text 7 text 8        |
text1|text2|text12      |text4 text 5| text 6| text 7               | text9|text10|text3|text4 text 5| text 11| text 12 text 8|
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7|
text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6         <-- expected
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7             <-- fixed

Answer 2

Based on your improved question:根据您改进的问题：

import itertools


data = """\
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
"""

lines = tuple(tuple(line.split('|')) for line in data.splitlines())

results = []
for line_a_index, line_a in enumerate(lines):
    # we want to compare each line with each other, so we start at index+1
    for line_b_index, line_b in enumerate(lines[line_a_index+1:], start=line_a_index+1):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        columns0_are_equal = line_a[0] == line_b[0]
        columns1_are_equal = line_a[1] == line_b[1]
        columns3_are_overlap = set(line_a[3]).issubset(set(line_b[3])) or set(line_b[3]).issubset(set(line_a[3]))
        columns4_are_overlap = set(line_a[4]).issubset(set(line_b[4])) or set(line_b[4]).issubset(set(line_a[4]))
        print(f"between lines index={line_a_index} and index={line_b_index}, {columns0_are_equal=} {columns1_are_equal=} {columns3_are_overlap=} {columns4_are_overlap=}")
        if (
            columns0_are_equal
            # and columns1_are_equal
            and (columns3_are_overlap or columns4_are_overlap)
        ):
            print("MATCH!")
            results.append(
                (line_a_index, line_b_index,) + tuple(
                    ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b
                    else cell_a
                    for cell_a, cell_b in itertools.zip_longest(line_a, line_b)
                )
            )

print("Fancy output :")
lines_to_display = set(itertools.chain.from_iterable((lines[result[0]], lines[result[1]], result[2:]) for result in results))
columns_widths = (max(len(str(index)) for result in results for index in (result[0], result[1])),) + tuple(
    max(len(cell) for cell in column)
    for column in zip(*lines_to_display)
)

for width in columns_widths:
    print("-" * width, end="|")
print("")

for result in results:
    for line_index, original_line in zip((result[0], result[1]), (lines[result[0]], lines[result[1]])):
        for column_index, cell in zip(itertools.count(), (str(line_index),) + original_line):
            if cell:
                print(cell.ljust(columns_widths[column_index]), end='|')
        print("", end='\n')  # explicit newline
    for column_index, cell in zip(itertools.count(), ("=",) + result[2:]):
        if cell:
            print(cell.ljust(columns_widths[column_index]), end='|')
    print("", end='\n')  # explicit newline

for width in columns_widths:
    print("-" * width, end="|")
print("")

expected_outputs = """\
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
""".splitlines()

for result, expected_output in itertools.zip_longest(results, expected_outputs):
    actual_output = "|".join(result[2:])
    assert actual_output.startswith(expected_output[:-3])  # minus the "..."

-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
0|Angela Darvill|19036321            |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['GB','US']       |['Salford', 'Eccles', 'Manchester']                                     |
2|Angela Darvill|190323121           |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['US']            |['Brighton', 'Eccles', 'Manchester']                                    |
=|Angela Darvill|19036321;190323121  |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['GB','US'];['US']|['Salford', 'Eccles', 'Manchester'];['Brighton', 'Eccles', 'Manchester']|
1|Helen Stanley |19036320            |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
3|Helen Stanley |19576876320         |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
=|Helen Stanley |19036320;19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|

You can see that the lines index 0 and 2 have been merged, same for the lines index 1 and 3.您可以看到索引 0 和 2 的行已合并，索引 1 和 3 的行相同。

在同一键下查找字典中值的交集 - python

问题描述

2 个解决方案

解决方案1
0 2022-01-28 14:14:34

解决方案2
0 2022-02-02 17:25:44

在同一键下查找字典中值的交集 - python

问题描述

2 个解决方案

解决方案1 0 2022-01-28 14:14:34

解决方案2 0 2022-02-02 17:25:44

解决方案1
0 2022-01-28 14:14:34

解决方案2
0 2022-02-02 17:25:44