如何從 CoNLL 格式更改為句子列表？

Question

我有一個理論上是 CoNLL 格式的 txt 文件。 像這樣：

a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT


existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)

我需要將其轉換為句子列表，但我沒有找到辦法。 我嘗試使用 conllu 庫的解析器：

from conllu import parse
sentences = parse("location/train_data.txt")

但他們給出了錯誤： ParseException：行格式無效，行必須包含制表符或兩個空格。

我怎樣才能得到這個？

["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]

謝謝

Answer 1

對於 NLP 問題，第一個起點是 Huggingface - 總是對我來說 -：D 有一個很好的例子來解決您的問題： https://huggingface.co/transformers/custom_datasets.ZFC35FDC70D5FC69D2698EZ83A8

在這里，他們展示了一個 function 正是你想要的：

from pathlib import Path
import re

def read_wnut(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text().strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split('\t')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)

    return token_docs, tag_docs

texts, tags = read_wnut("location/train_data.txt")

Answer 2

您可以使用 conllu 庫。

使用pip install conllu 。

下面顯示了一個示例用例。

>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]

Answer 3

最簡單的事情是遍歷文件的行，然后檢索第一列。 無需進口。

result=[[]]
with open(YOUR_FILE,"r") as input:
    for l in input:
        if not l.startswith("#"):
            if l.strip()=="":
                if len(result[-1])>0:
                    result.append([])
            else:
                result[-1].append(l.split()[0])
result=[ " ".join(row) for row in result ]

以我的經驗，手工編寫這些是最有效的方式，因為 CoNLL 格式非常多樣化（但通常以微不足道的方式，例如列的順序），並且您不想為任何可能的代碼打擾其他人的代碼所以簡單地解決了。 例如，@markusodenthal 引用的代碼將維護 CoNLL 注釋（以#開頭的行）——這可能不是您想要的。

另一件事是，自己編寫循環允許您逐句處理，而不是先將所有內容讀入數組。 如果您不需要整體處理，這將更快且更具可擴展性。

如何從 CoNLL 格式更改為句子列表？

問題描述

3 個解決方案

解決方案1
1 已采納 2021-04-07 11:44:03

解決方案2
0 2021-04-07 17:44:20

解決方案3
0 2021-06-12 16:50:00

如何從 CoNLL 格式更改為句子列表？

問題描述

3 個解決方案

解決方案1 1 已采納 2021-04-07 11:44:03

解決方案2 0 2021-04-07 17:44:20

解決方案3 0 2021-06-12 16:50:00

解決方案1
1 已采納 2021-04-07 11:44:03

解決方案2
0 2021-04-07 17:44:20

解決方案3
0 2021-06-12 16:50:00