如何将 XML 中的注释文本转换为 CONLL？

Question

我需要为 NER 任务预处理 XML 文件，并且我正在努力转换 XML 文件。 我想有一种很好且简单的方法可以解决以下问题。

给定 XML 中的注释文本，输入结构如下：

<doc>
   Some <tag1>annotated text</tag1> in <tag2>XML</tag2>.
</doc>

我想要一个 IOB2 标记格式的 CoNLL 文件，如下 output：

Some          O
annotated     B-TAG1
text          I-TAG1
in            O
XML           B-TAG2
.             O

Answer 1

让我们将您的 XML 文件转换为 TXT（称为“read.txt”），如下所示：

<doc>
   Some <tag1>annotated text</tag1> in <tag2>Tag2 entity</tag2> <tag1>tag1 entity</tag1>.
   Some <tag3>annotated text</tag3> in <tag2>XML</tag2>!
</doc>

然后使用正则表达式和几个 if-else 条件，下面的代码根据需要返回 CONNL 格式的“output.txt”文件。

import re

sentences, connl = [], []

with open('read.txt', 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line not in ['<doc>', '</doc>']:
            sentences.append(line)

for sentence in sentences:
    tag1 = re.findall(r'<tag1>(.+?)</tag1>', sentence)
    tag2 = re.findall(r'<tag2>(.+?)</tag2>', sentence)
    tag3 = re.findall(r'<tag3>(.+?)</tag3>', sentence)
    splitted = re.split('<tag1>|</tag1>|<tag2>|</tag2>|<tag3>|</tag3>', sentence)  # splitted considering tags
    if tag1 or tag2 or tag3:  # if any tag in sentence
        for split in splitted:  # search each index
            if split in tag1:
                counter = 0
                for token in split.split():
                    if counter > 0:
                        connl.append(token + ' I-TAG1')
                    else:
                        connl.append(token + ' B-TAG1')
                    counter += 1

            elif split in tag2:
                counter = 0
                for token in split.split():
                    if counter > 0:
                        connl.append(token + ' I-TAG2')
                    else:
                        connl.append(token + ' B-TAG2')
                    counter += 1

            elif split in tag3:
                counter = 0
                for token in split.split():
                    if counter > 0:
                        connl.append(token + ' I-TAG3')
                    else:
                        connl.append(token + ' B-TAG3')
                    counter += 1

            else:  # current word is not an entity
                for token in split.split():
                    connl.append(token + ' O')

    else:  # if no entity in sentence
        for word in sentence.split():
            connl.append(word + ' O')

    connl.append('')

with open('output.txt', 'w', encoding='utf-8') as output:
    for element in connl:
        output.write(element + "\n")

output.txt：

Some O
annotated B-TAG1
text I-TAG1
in O
XML B-TAG2
other B-TAG1
tag I-TAG1
. O

Some O
annotated B-TAG3
text I-TAG3
in O
XML B-TAG2
! O

如何将 XML 中的注释文本转换为 CONLL？

问题描述

1 个解决方案

解决方案1
1 2021-12-24 22:32:41

如何将 XML 中的注释文本转换为 CONLL？

问题描述

1 个解决方案

解决方案1 1 2021-12-24 22:32:41

解决方案1
1 2021-12-24 22:32:41