[英]how to find the shortest distance between two regex
我有一組文檔,可以在其中搜索特定實體,我需要找到兩者之間的最短距離。 假設我有一個文件,我在其中搜索Trump
和Ukraine
,我得到了提及的列表,以及它們的開始和結束位置:
import re
text = """
Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)
輸出:
[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]
在這種特定情況下,答案是148 - 125 = 23
。 你會建議如何以最 Pythonic 的方式做到這一點?
一種解決方案是提取匹配並找到它的長度如下
min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])
這里打印 23
使用itertools.product
:
min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)
或者使用最新的 Python 3.8+ 使用walrus運算符,我想您也可以這樣做(未經測試):
min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)
代碼:
from itertools import product
res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]
print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23
別忘了取兩點之間距離的絕對值,否則最短距離會變成負數,我假設這不是你想要的:
dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]
shortest = 99999999
start = -1
end = -1
for i in range(len(dict)):
for j in range(len(dict)):
if(i != j):
dist = abs(dict[i]['start'] - dict[j]['end'])
if(dist < shortest):
shortest = dist
start = i
end = j
print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.