Readlines 在多行后导致错误？

Question

I'm working on a NRE task at the moment, with data from wnut17train.conll ( https://github.com/leondz/emerging_entities_17 ).我目前正在处理 NRE 任务，数据来自wnut17train.conll ( https://github.com/leondz/emerging_entities_17 )。 It's basically a collection of tweets where each line is a single word from the tweet with an IOB tag attached (separated by a \\t ).它基本上是推文的集合，其中每一行都是推文中的一个单词，并附有 IOB 标签（由\\t分隔）。 Different tweets are separated by a blank line (actually, and weirdly enough if you ask me, a '\\t\\n' line).不同的推文由一个空行分隔（实际上，如果你问我，很奇怪， '\\t\\n'行）。

So, for reference, a single tweet would look like this:因此，作为参考，一条推文将如下所示：

@paulwalk    IOBtag
...          ...
foo          IOBtag
[\t\n]
@jerrybeam   IOBtag
...          ...
bar          IOBtag

The goal for this first step is to achieve a situation where I converted this data set into a training file looking like this:第一步的目标是实现一种情况，我将此数据集转换为如下所示的训练文件：

train[0] = [(first_word_of_first_tweet, POStag, IOBtag),
(second_word_of_first_tweet, POStag, IOBtag),
...,
last_word_of_first_tweet, POStag, IOBtag)]

This is what I came up so far:这是我到目前为止的想法：

tmp = []
train = []
nlp = spacy.load("en_core_web_sm")
with open("wnut17train.conll") as f:
    for l in f.readlines():
        if l == '\t\n':
            train.append(tmp)
            tmp = []
        else:
            doc = nlp(l.split()[0])
            for token in doc:
                tmp.append((token.text, token.pos_, token.ent_iob_))

Everything works smoothly for a certain amount of tweets (or lines, not sure yet), but after that I get a对于一定数量的推文（或台词，尚不确定），一切正常，但在那之后我得到了

IndexError: list index out of range

raised by由

doc = nlp(l.split()[0])

First time I got it around line 20'000 (20'533 to be precise), then after checking that this was not due to the file (maybe a different way of separating tweets, or something like this that might have tricked the parser) I removed the first 20'000 lines and tried again.我第一次在第 20'000 行（准确地说是 20'533）附近得到它，然后在检查这不是由于文件引起的（可能是分离推文的不同方式，或者类似的东西可能欺骗了解析器）我删除了前 20'000 行并再次尝试。 Again, I got an error after around line 20'000 (20'260 - or 40'779 in the original file - to be precise).同样，我在大约 20'000 行（准确地说是原始文件中的 20'260 - 或 40'779）之后出现错误。

I did some research on readlines() to see if this was a known problem but it looks like it's not.我对readlines()做了一些研究，看看这是否是一个已知问题，但看起来不是。 Am I missing something?我错过了什么吗？

Answer 1

I used the wnut17train.conll file from https://github.com/leondz/emerging_entities_17 and I ran a similar code to generate your required output.我使用了https://github.com/leondz/emerging_entities_17 中的 wnut17train.conll 文件，并运行了类似的代码来生成所需的输出。 I found that in some lines instead of "\\t\\n" as the blank Line we have only "\\n".我发现在某些行中，而不是“\\t\\n”作为空白行，我们只有“\\n”。

Due to this l.split() will give an IndexError: list index out of range.由于这个 l.split() 将给出一个 IndexError: list index out of range。 To handle this we can check if length is 1 and in that case also we add our tmp to train.为了解决这个问题，我们可以检查长度是否为 1，在这种情况下，我们还将我们的 tmp 添加到训练中。

import spacy
nlp = spacy.load("en_core_web_sm")
train = []
tmp = []
with open("wnut17train.conll") as fp:
    for l in fp.readlines():
        if l == "\t\n" or len(l) == 1:
            train.append(tmp)
            tmp = []
        else:
            doc = nlp(l.split("\t")[0])
            for token in doc:
                tmp.append((l.split("\t")[0], token.pos_, l.split("\t")[1]))

Hope your question is resolved.希望你的问题得到解决。

Readlines 在多行后导致错误？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-21 10:31:41

Readlines 在多行后导致错误？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-21 10:31:41

解决方案1
1 已采纳 2020-10-21 10:31:41