简体   繁体   中英

how to find the shortest distance between two regex

I have a set of documents, where I can search for specific entities, and I need to find the shortest distance between two. Let's say I have a document where I search for Trump and Ukraine , and I get the list of mentions, with their start and end positions:

import re

text = """
 Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
 ˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)

output:

[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

In this specific case the answer is 148 - 125 = 23 . How would you recommend to do it in the most pythonic way?

one solutions is to extract the match and find it's length as below

min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])

Here it prints 23

Use itertools.product :

min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)

Or with latest Python 3.8+ leveraging the walrus operator, I guess you can also do (untested):

min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)

Code :

from itertools import product

res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23

Don't forget to take the absolute value of the distance between two points, otherwise the shortest distance will become negative, which is I assume not what you want:

dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

shortest = 99999999
start = -1
end = -1

for i in range(len(dict)):
    for j in range(len(dict)):
        if(i != j):
            dist = abs(dict[i]['start'] - dict[j]['end'])
            if(dist < shortest):
                shortest = dist
                start = i
                end = j

print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM