简体   繁体   中英

How to find the span of multiple sub-string in one string using python?

I have a text like

xxxx BP 160/110 12/6/2018 sitting left arm @xyz hospital xxxx HgbA1c 12% on 21/1/2019 xxxx

and another string

bp 160/110 hgba1c 12%

Now, how can I get the span of each finding as below

[(5, 15), (62, 72)]

Note: The above mentioned patterns can vary a lot. So I want to achieve some dynamic solution.

Thanks in Advance

This function will find the minimum and maximum bounds that contain the substring (otherwise, it will return False)

import collections

def find_substring_bounds(a, b):
    need = collections.Counter(b)
    missing = len(b)
    for end, char in enumerate(a, 1):
        if need[char] > 0:
            missing -= 1
        need[char] -= 1
        if missing == 0: # found all the characters
            start = 0
            while start < end and need[a[start]] < 0:
                need[a[start]] += 1
                start += 1
            need[a[start]] += 1
            return start, end
    return False

We then need to find the middle-left and middle-right bounds:

def mid_left_mid_right(a, b, left, right):
    mid_left = left
    for mid_left, (c1, c2) in enumerate(zip(a[left:], b)):
        if c1 != c2:
            break
    mid_right = right
    for mid_right, (c1, c2) in enumerate(zip(a[:right][::-1], b[::-1])):
        if c1 != c2:
            break
    return [(left, left+mid_left), (right-mid_right, right)]

Example:

s1 = "xxxx BP 160/110 12/6/2018 sitting left arm @xyz hospital xxxx HgbA1c 12% on 21/1/2019 xxxx"
s2 = "bp 160/110 hgba1c 12%"
left_, right_ = find_substring_bounds(s1.lower(), s2)
res = mid_left_mid_right(s1.lower(), s2, left_, right_)
print(res)

Outputs:

[(5, 16), (61, 72)]

You may need to amend this for any edge cases in your dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM