[英]How to convert annotated text in XML to CONLL?
我需要为 NER 任务预处理 XML 文件,并且我正在努力转换 XML 文件。 我想有一种很好且简单的方法可以解决以下问题。
给定 XML 中的注释文本,输入结构如下:
<doc>
Some <tag1>annotated text</tag1> in <tag2>XML</tag2>.
</doc>
我想要一个 IOB2 标记格式的 CoNLL 文件,如下 output:
Some O
annotated B-TAG1
text I-TAG1
in O
XML B-TAG2
. O
让我们将您的 XML 文件转换为 TXT(称为“read.txt”),如下所示:
<doc>
Some <tag1>annotated text</tag1> in <tag2>Tag2 entity</tag2> <tag1>tag1 entity</tag1>.
Some <tag3>annotated text</tag3> in <tag2>XML</tag2>!
</doc>
然后使用正则表达式和几个 if-else 条件,下面的代码根据需要返回 CONNL 格式的“output.txt”文件。
import re
sentences, connl = [], []
with open('read.txt', 'r', encoding='utf-8') as file:
for line in file:
line = line.strip()
if line not in ['<doc>', '</doc>']:
sentences.append(line)
for sentence in sentences:
tag1 = re.findall(r'<tag1>(.+?)</tag1>', sentence)
tag2 = re.findall(r'<tag2>(.+?)</tag2>', sentence)
tag3 = re.findall(r'<tag3>(.+?)</tag3>', sentence)
splitted = re.split('<tag1>|</tag1>|<tag2>|</tag2>|<tag3>|</tag3>', sentence) # splitted considering tags
if tag1 or tag2 or tag3: # if any tag in sentence
for split in splitted: # search each index
if split in tag1:
counter = 0
for token in split.split():
if counter > 0:
connl.append(token + ' I-TAG1')
else:
connl.append(token + ' B-TAG1')
counter += 1
elif split in tag2:
counter = 0
for token in split.split():
if counter > 0:
connl.append(token + ' I-TAG2')
else:
connl.append(token + ' B-TAG2')
counter += 1
elif split in tag3:
counter = 0
for token in split.split():
if counter > 0:
connl.append(token + ' I-TAG3')
else:
connl.append(token + ' B-TAG3')
counter += 1
else: # current word is not an entity
for token in split.split():
connl.append(token + ' O')
else: # if no entity in sentence
for word in sentence.split():
connl.append(word + ' O')
connl.append('')
with open('output.txt', 'w', encoding='utf-8') as output:
for element in connl:
output.write(element + "\n")
output.txt:
Some O
annotated B-TAG1
text I-TAG1
in O
XML B-TAG2
other B-TAG1
tag I-TAG1
. O
Some O
annotated B-TAG3
text I-TAG3
in O
XML B-TAG2
! O
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.