I have a text like
xxxx BP 160/110
12/6/2018 sitting left arm @xyz hospital xxxx HgbA1c 12%
on 21/1/2019 xxxx
and another string
bp 160/110 hgba1c 12%
Now, how can I get the span of each finding as below
[(5, 15), (62, 72)]
Note: The above mentioned patterns can vary a lot. So I want to achieve some dynamic solution.
Thanks in Advance
This function will find the minimum and maximum bounds that contain the substring (otherwise, it will return False)
import collections
def find_substring_bounds(a, b):
need = collections.Counter(b)
missing = len(b)
for end, char in enumerate(a, 1):
if need[char] > 0:
missing -= 1
need[char] -= 1
if missing == 0: # found all the characters
start = 0
while start < end and need[a[start]] < 0:
need[a[start]] += 1
start += 1
need[a[start]] += 1
return start, end
return False
We then need to find the middle-left and middle-right bounds:
def mid_left_mid_right(a, b, left, right):
mid_left = left
for mid_left, (c1, c2) in enumerate(zip(a[left:], b)):
if c1 != c2:
break
mid_right = right
for mid_right, (c1, c2) in enumerate(zip(a[:right][::-1], b[::-1])):
if c1 != c2:
break
return [(left, left+mid_left), (right-mid_right, right)]
Example:
s1 = "xxxx BP 160/110 12/6/2018 sitting left arm @xyz hospital xxxx HgbA1c 12% on 21/1/2019 xxxx"
s2 = "bp 160/110 hgba1c 12%"
left_, right_ = find_substring_bounds(s1.lower(), s2)
res = mid_left_mid_right(s1.lower(), s2, left_, right_)
print(res)
Outputs:
[(5, 16), (61, 72)]
You may need to amend this for any edge cases in your dataset.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.