Replace all occurrences of a string (given their starting and ending index ) with a unique ID in Python

Question

I am working on a dataset that I have to preprocess. I want to replace all occurrences (given by starting and ending index) with their unique IDs.

Given a string of text like:

s = "The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM)."

and a list of dictionaries like:

[

'D006973': [{'length': '12', 'offset': '199', 'text': ['hypertensive'], 'type': 'Disease'}],

'D008750': [{'length': '16', 'offset': '36', 'text': ['alpha-methyldopa'], 'type': 'Chemical'}],

'D007022': [{'length': '11', 'offset': '4', 'text': ['hypotensive'], 'type': 'Disease'}],

'D009270': [{'length': '8', 'offset': '84', 'text': ['naloxone'], 'type': 'Chemical'}, {'length': '8', 'offset': '94', 'text': ['Naloxone'], 'type': 'Chemical'}, {'length': '13', 'offset': '293', 'text': ["[3H]-naloxone"], 'type': 'Chemical'}]

]

I want to replace all occurrences given by offsets with their respective IDs. So for last dictionary I want all the values in the list to be replaced by 'D009270'.

Example 1: for first dictionary with key 'D006973', I want to replace 'hypertensive', which is present at index 199 and is of length 12, with 'D006973'.

Example 2: for last dictionary with key 'D009270', I want to replace substrings from indices (given by tuples)

[(84, 92), (94, 102), (293, 306)]

In last sentence, naloxone is present with " naloxone-suppressible ", but I don't want to replace it. So I cannot simply use str.replace() .
I replaced string from starting index to end index (ex: 199 to 211 for 'hypertensive') with its unique ID. But this disturbs offsets of other 'yet to be replaced' entities. I could use padding for when the text to be replaced ('D006973') is smaller than the string ('hypertensive'). But it will fail when the text to be repaced is greater in size.

Answer 1

You can use string formatters with a placeholder character:

from operator import itemgetter

s = "The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM)."

dictionary={
'D006973': [{'length': '12', 'offset': '199', 'text': ['hypertensive'], 'type': 'Disease'}],
'D008750': [{'length': '16', 'offset': '36', 'text': ['alpha-methyldopa'], 'type': 'Chemical'}],
'D007022': [{'length': '11', 'offset': '4', 'text': ['hypotensive'], 'type': 'Disease'}],
'D009270': [{'length': '8', 'offset': '84', 'text': ['naloxone'], 'type': 'Chemical'}, {'length': '8', 'offset': '94', 'text': ['Naloxone'], 'type': 'Chemical'}, {'length': '13', 'offset': '293', 'text': ["[3H]-naloxone"], 'type': 'Chemical'}]
}

index_list=[]
for key in dictionary:
    for dic in dictionary[key]:
        o=int(dic['offset'])
        index_tuple=o , o+int(dic['length']),key
        index_list.append(index_tuple)

index_list.sort(key=itemgetter(0))
format_list=[]
lt=list(s)
for i,j in enumerate(index_list):
    si=j[0]
    ei=j[1]
    lt[si:ei]=list("{}") + ["@"]*((ei-si)-2)
    format_list.append(j[2])

text = "".join(lt)
text = text.replace("@","")
text = text.format(*format_list)

Result : 'The D007022 effect of 100 mg/kg D008750 was also partially reversed by D009270. D009270 alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously D006973 rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of D009270 (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM).' 'The D007022 effect of 100 mg/kg D008750 was also partially reversed by D009270. D009270 alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously D006973 rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of D009270 (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM).'

Replace all occurrences of a string (given their starting and ending index ) with a unique ID in Python

Question

1 answers

solution1
0 ACCPTED 2017-06-05 17:31:00

Replace all occurrences of a string (given their starting and ending index ) with a unique ID in Python

Question

1 answers

solution1 0 ACCPTED 2017-06-05 17:31:00

solution1
0 ACCPTED 2017-06-05 17:31:00