[英]From list of strings, create new list where each item indicates if corresponding item in original list is in between two specific entries
說我有這個清單
jay = ['Despite', 'similar', 'intensity', 'of', 'alcohol', '<Disease:D013375>', 'withdrawal', 'symptoms', '</Disease:D013375>', ',', 'ALC', '/', 'COC', 'subjects', 'received', 'less', 'oxazepam', 'to', 'treat', 'alcohol', '<Disease:D013375>', 'withdrawal', 'symptoms', '</Disease:D013375>', 'compared', 'to', 'ALC', 'subjects', '.']
我要創建一個與原始列表相對應的新列表。 如果某個項目介於'<Disease:XXXXX>'
和'</Disease:XXXXX>'
,則第一個項目將被標記為“ B-COL”,其余項目將被標記為“ I-COL”。
項目'<Disease:XXXXX>'
和'</Disease:XXXXX>'
本身沒有任何標簽。 XXXX的范圍可以是數字。
所有其他項目都標記為“ O”。
所以這是我想要的示例輸出。
idealOutput= ['O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O']
“疾病”標簽對的數量可以變化,這些標簽之間的項目數量也可以變化。
這是我的嘗試:
wow = jay
labs = []
for i in range(0, len(wow)):
if wow[i].startswith("<Disease"):
labs.append('DelStrB')
elif i>0 and i<=len(labs):
if labs[i-1] == 'DelStrB':
labs.append('B-COL')
i = i + 1
while not (wow[i].startswith("</Disease")):
labs.append('I-COL')
i = i + 1
if wow[i].startswith("</Disease"):
labs.append('DelStrE')
i = i + 1
elif wow[i].startswith("</Disease"):
k=9 #do nothing
else:
labs.append('O')
elif wow[i].startswith("</Disease"):
k=9 #do nothing
else:
labs.append('O')
labs[:] = [x for x in labs if x != 'DelStrB']
labs[:] = [x for x in labs if x != 'DelStrE']
print(labs)
結果是
['O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COL', 'O', 'O', 'O', 'O', 'O']
這是不正確的。 我也知道,有一種計算效率更高,更優雅的編碼方式,但是無法產生。
您可以使用一個簡單的生成器:
import re
jay = ['Despite', 'similar', 'intensity', 'of', 'alcohol', '<Disease:D013375>', 'withdrawal', 'symptoms', '</Disease:D013375>', ',', 'ALC', '/', 'COC', 'subjects', 'received', 'less', 'oxazepam', 'to', 'treat', 'alcohol', '<Disease:D013375>', 'withdrawal', 'symptoms', '</Disease:D013375>', 'compared', 'to', 'ALC', 'subjects', '.']
def results(d):
_flag = -1
for i in d:
if re.findall('\<Disease:\w+\>', i):
_flag = 1
elif re.findall('\</Disease:\w+\>', i):
_flag = -1
else:
if _flag == -1:
yield 'O'
elif _flag == 1:
yield 'B-COL'
_flag = 0
else:
yield 'I-COL'
print(list(results(jay)))
輸出:
['O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O']
使用迭代方法的解決方案:
jay = ['Despite', 'similar', 'intensity', 'of', 'alcohol', '<Disease:D013375>', 'withdrawal', 'symptoms', '</Disease:D013375>', ',', 'ALC', '/', 'COC', 'subjects', 'received', 'less', 'oxazepam', 'to', 'treat', 'alcohol', '<Disease:D013375>', 'withdrawal', 'symptoms', '</Disease:D013375>', 'compared', 'to', 'ALC', 'subjects', '.']
result = []
inside = False
seen_BCOL = False
for i in range(len(jay)):
if jay[i].startswith('<Disease'):
inside = True
elif jay[i].startswith('</Disease'):
inside = False
seen_BCOL = False
elif inside == True:
if seen_BCOL == False:
result.append('B-COL')
seen_BCOL = True
else:
result.append('I-COL')
elif inside == False:
result.append('O')
print(result)
['0', '0', '0', '0', '0', '0', 'B-COL', 'I-COL', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', 'B-COL', 'I-COL', '0', '0', '0', '0', '0', '0']
您可以將itertools.groupby
與用於查找“疾病”項的按鍵功能一起使用,以將列表分為奇數和偶數組,以實現不同的標記方法:
import re
from itertools import groupby
[t for i, l in enumerate(list(g) for k, g in groupby(jay, key=lambda s: re.match(r'</?Disease:\w+>', s)) if not k) for t in (('B-COL',) + ('I-COL',) * (len(l) - 1) if i % 2 else ('O',) * len(l))]
返回:
['O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COL', 'I-COL', 'O', 'O', 'O', 'O', 'O']
請注意,您的預期輸出不正確,因為在'B-COL'
和'I-COL'
的兩個序列之間還有兩個'O'
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.