簡體   English   中英

如何找到兩個正則表達式之間的最短距離

[英]how to find the shortest distance between two regex

我有一組文檔,可以在其中搜索特定實體,我需要找到兩者之間的最短距離。 假設我有一個文件,我在其中搜索TrumpUkraine ,我得到了提及的列表,以及它們的開始和結束位置:

import re

text = """
 Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
 ˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)

輸出:

[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

在這種特定情況下,答案是148 - 125 = 23 你會建議如何以最 Pythonic 的方式做到這一點?

一種解決方案是提取匹配並找到它的長度如下

min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])

這里打印 23

使用itertools.product

min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)

或者使用最新的 Python 3.8+ 使用walrus運算符,我想您也可以這樣做(未經測試):

min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)

代碼

from itertools import product

res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23

別忘了取兩點之間距離的絕對值,否則最短距離會變成負數,我假設這不是你想要的:

dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

shortest = 99999999
start = -1
end = -1

for i in range(len(dict)):
    for j in range(len(dict)):
        if(i != j):
            dist = abs(dict[i]['start'] - dict[j]['end'])
            if(dist < shortest):
                shortest = dist
                start = i
                end = j

print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM