[英]Find intersection of values in dictionary under a same key - python
I have a file with lines, each line is split on "|", I want to compare arguments 5 from each line and if intersect, then proceed.我有一个带行的文件,每一行都在“|”上分割,我想比较每行的 arguments 5 如果相交,然后继续。 This gets to a second part:
这进入第二部分:
How to compare intersection of values under the same key?如何比较同一键下的值的交集? The code below works cross-key but not within same key:
下面的代码跨键工作,但不在同一个键内:
from functools import reduce
reduce(set.intersection, (set(val) for val in query_dict.values()))
Here is an example of lines: text1|text2|text3|text4 text 5|以下是行示例: text1|text2|text3|text4 text 5| text 6|
正文 6| text 7 text 8|
文字 7 文字 8| text1|text2|text12|text4 text 5|
文本1|文本2|文本12|文本4 文本5| text 6|
正文 6| text 7|
文字 7| text9|text10|text3|text4 text 5|
文本9|文本10|文本3|文本4 文本5| text 11|
正文 11| text 12 text 8|
文字 12 文字 8|
The output should be: text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6 output 应该是:text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6
In other words, only those lines that are matching by 1st,2nd arguments (cells equal) and if 5th,6th arguments are overlapping (intersection) are concatenated.换句话说,只有那些与第 1、2 个 arguments(单元格相等)匹配的行以及如果第 5、6 个 arguments 重叠(相交)的行被连接起来。
Here is input file:这是输入文件:
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
The output should look like: output 应如下所示:
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
Angela Darvill gets stacked because two records share same name, same country and same city(-ies).安吉拉·达维尔(Angela Darvill)被堆积,因为两条记录共享相同的名称、相同的国家和相同的城市(-ies)。
from itertools import zip_longest
data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""
lines = tuple(line.split('|') for line in data.splitlines())
number_of_lines = len(lines)
print(f"number of lines : {number_of_lines}")
print(f"number of cells in line 1 : {len(lines[0])}")
print(f"number of cells in line 2 : {len(lines[1])}")
print(f"{lines[0]=}")
print(f"{lines[1]=}")
result = []
# we want to compare each line with each other :
for line_a_index, line_a in enumerate(lines):
for line_b_index, line_b in enumerate(lines[line_a_index+1:]):
assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
assert all(isinstance(cell, str) for cell in line_a)
assert all(isinstance(cell, str) for cell in line_b)
if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
line_a[5] in line_b[5] or line_a[6] in line_b[6] # A in B
or line_b[5] in line_a[5] or line_b[6] in line_a[6] # B in A
):
result.append(tuple(
((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1]) # <-- here I truncated the lines
))
# I decided to have a fancy output, but I made some simplifying assumptions to make it simple
if len(result) > 1:
raise NotImplementedError
widths = tuple(max(len(a) if a is not None else 0, len(b) if b is not None else 0, len(c) if c is not None else 0)
for a, b, c in zip_longest(lines[0], lines[1], result[0]))
length = max(len(lines[0]), len(lines[1]), len(result[0]))
for line in (lines[0], lines[1], result[0]):
for index, cell in zip_longest(range(length), line):
if cell:
print(cell.ljust(widths[index]), end='|')
print("", end='\n') # explicit newline
original_expected_output = "text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6"
print(f"{original_expected_output} <-- expected")
lenormju_expected_output = "text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7"
print(f"{lenormju_expected_output} <-- fixed")
output output
number of lines : 2
number of cells in line 1 : 7
number of cells in line 2 : 13
lines[0]=['text1', 'text2', 'text3', 'text4 text 5', ' text 6', ' text 7 text 8', '']
lines[1]=['text1', 'text2', 'text12', 'text4 text 5', ' text 6', ' text 7', ' text9', 'text10', 'text3', 'text4 text 5', ' text 11', ' text 12 text 8', '']
text1|text2|text3 |text4 text 5| text 6| text 7 text 8 |
text1|text2|text12 |text4 text 5| text 6| text 7 | text9|text10|text3|text4 text 5| text 11| text 12 text 8|
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7|
text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6 <-- expected
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7 <-- fixed
Based on your improved question:根据您改进的问题:
import itertools
data = """\
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
"""
lines = tuple(tuple(line.split('|')) for line in data.splitlines())
results = []
for line_a_index, line_a in enumerate(lines):
# we want to compare each line with each other, so we start at index+1
for line_b_index, line_b in enumerate(lines[line_a_index+1:], start=line_a_index+1):
assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
assert all(isinstance(cell, str) for cell in line_a)
assert all(isinstance(cell, str) for cell in line_b)
columns0_are_equal = line_a[0] == line_b[0]
columns1_are_equal = line_a[1] == line_b[1]
columns3_are_overlap = set(line_a[3]).issubset(set(line_b[3])) or set(line_b[3]).issubset(set(line_a[3]))
columns4_are_overlap = set(line_a[4]).issubset(set(line_b[4])) or set(line_b[4]).issubset(set(line_a[4]))
print(f"between lines index={line_a_index} and index={line_b_index}, {columns0_are_equal=} {columns1_are_equal=} {columns3_are_overlap=} {columns4_are_overlap=}")
if (
columns0_are_equal
# and columns1_are_equal
and (columns3_are_overlap or columns4_are_overlap)
):
print("MATCH!")
results.append(
(line_a_index, line_b_index,) + tuple(
((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b
else cell_a
for cell_a, cell_b in itertools.zip_longest(line_a, line_b)
)
)
print("Fancy output :")
lines_to_display = set(itertools.chain.from_iterable((lines[result[0]], lines[result[1]], result[2:]) for result in results))
columns_widths = (max(len(str(index)) for result in results for index in (result[0], result[1])),) + tuple(
max(len(cell) for cell in column)
for column in zip(*lines_to_display)
)
for width in columns_widths:
print("-" * width, end="|")
print("")
for result in results:
for line_index, original_line in zip((result[0], result[1]), (lines[result[0]], lines[result[1]])):
for column_index, cell in zip(itertools.count(), (str(line_index),) + original_line):
if cell:
print(cell.ljust(columns_widths[column_index]), end='|')
print("", end='\n') # explicit newline
for column_index, cell in zip(itertools.count(), ("=",) + result[2:]):
if cell:
print(cell.ljust(columns_widths[column_index]), end='|')
print("", end='\n') # explicit newline
for width in columns_widths:
print("-" * width, end="|")
print("")
expected_outputs = """\
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
""".splitlines()
for result, expected_output in itertools.zip_longest(results, expected_outputs):
actual_output = "|".join(result[2:])
assert actual_output.startswith(expected_output[:-3]) # minus the "..."
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
0|Angela Darvill|19036321 |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK. |['GB','US'] |['Salford', 'Eccles', 'Manchester'] |
2|Angela Darvill|190323121 |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK. |['US'] |['Brighton', 'Eccles', 'Manchester'] |
=|Angela Darvill|19036321;190323121 |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK. |['GB','US'];['US']|['Salford', 'Eccles', 'Manchester'];['Brighton', 'Eccles', 'Manchester']|
1|Helen Stanley |19036320 |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US'] |['Brighton', 'Brighton'] |
3|Helen Stanley |19576876320 |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US'] |['Brighton', 'Brighton'] |
=|Helen Stanley |19036320;19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US'] |['Brighton', 'Brighton'] |
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
You can see that the lines index 0 and 2 have been merged, same for the lines index 1 and 3.您可以看到索引 0 和 2 的行已合并,索引 1 和 3 的行相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.