简体   繁体   中英

How to get the sub-string of the original sentence from a sub-string in the converted sentence

I'm working on a phone number detector which takes in a sentence with phone numbers in the form of numbers or words. Before detecting a phone number, I'm converting all the numbers in word format into digits. Now, once I got the detected phone number, I'll have

  • an Original Text (Ex. 'My number is seven double nine seven eight two six two seven three for now.'),
  • a Converted text (Ex. 'My number is 7997827273 for now.') and
  • offsets of the detected phone number in the converted string.(Ex. [13,22] ) ie( '7997826273' ).

Now, I want to get the corresponding substring in the original text for the detected number. ie( 'seven double nine seven eight two six two seven three' ).

I tried using regex to create a regex pattern of the areas in the converted text which doesn't have the detected number and use that pattern on original text to detect my required original substring.

But, this will fail if the user message has a random number in it. Ex: 'I'm the one and My number is seven double nine seven eight two six two seven three for now.'

Below is my code for the above regex attempt. It needs to be modified to accommodate multiple phone numbers in a sentence. But, I stopped because of the above-mentioned problem.

import re
a='My number is seven double nine seven eight two six two seven three for now.'
b='My number is 7997826273 for now'
c='7997826273'
d=[13,22]
def create_pattern(detected_array):
    pattern = ''
    for text_match in detected_array:
        pattern+=('(.*)'+'(?:'+ text_match + ')')
    pattern+=('(.*$)')
    return pattern

pattern_1 = create_pattern([c])
patterns = re.findall(pattern_1, b)
print(patterns) # [('My number is ', ' for now')]

pattern_2 = create_pattern(list(patterns[0]))
patterns = re.findall(pattern_2, a)
print(patterns[0][1]) # seven double nine seven eight two six two seven three

Any help is appreciated. Python implementation is preferred. Thanks.

When converting words to numbers in the original text also save the indices for the span(s) of the original words, then when a phone number is detected in the converted text, you can reconstruct the indices span in the original text.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM