繁体   English   中英

如何找到两个正则表达式之间的最短距离

[英]how to find the shortest distance between two regex

我有一组文档,可以在其中搜索特定实体,我需要找到两者之间的最短距离。 假设我有一个文件,我在其中搜索TrumpUkraine ,我得到了提及的列表,以及它们的开始和结束位置:

import re

text = """
 Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
 ˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)

输出:

[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

在这种特定情况下,答案是148 - 125 = 23 你会建议如何以最 Pythonic 的方式做到这一点?

一种解决方案是提取匹配并找到它的长度如下

min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])

这里打印 23

使用itertools.product

min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)

或者使用最新的 Python 3.8+ 使用walrus运算符,我想您也可以这样做(未经测试):

min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)

代码

from itertools import product

res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23

别忘了取两点之间距离的绝对值,否则最短距离会变成负数,我假设这不是你想要的:

dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

shortest = 99999999
start = -1
end = -1

for i in range(len(dict)):
    for j in range(len(dict)):
        if(i != j):
            dist = abs(dict[i]['start'] - dict[j]['end'])
            if(dist < shortest):
                shortest = dist
                start = i
                end = j

print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM