简体   繁体   English

从 nltk 语料库中随机读取句子

[英]Read randomly sentences from nltk corpus

I'm working on my university project and I have to read randomly 50 sentences from a NLTK Corpus (SemCor).我正在做我的大学项目,我必须从 NLTK 语料库 (SemCor) 中随机读取 50 个句子。

Currently I was only able to read the first 50 sentences as following:目前我只能阅读前 50 个句子,如下所示:

from nltk.corpus import semcor as corpus

def get_sentence_from_semcor(sentence_num):
   sentence = " ".join(corpus.sents()[sentence_num])
   tags = corpus.tagged_sents(tag="sem")[sentence_num]
   for curr_word in range(len(tags)):
         if isinstance(tags[curr_word], nltk.Tree) and isinstance(tags[curr_word][0], str) and isinstance(tags[curr_word].label(), nltk.corpus.reader.wordnet.Lemma):
             word = tags[curr_word][0]
             target = tags[curr_word].label().synset()
             sentence_no_word = sentence.replace(word, "")
   return word, sentence_no_word, target

   corpus_sentences = [get_sentence_from_semcor(i) for i in range(50)]

Any Help on how I could select randomly 50 sentences of the corpus?关于如何随机选择语料库中的 50 个句子的任何帮助?

Well you are wanting randomness, so let's import the random library:那么你想要随机性,所以让我们导入random库:

import random

Then we need to know what our constraints are.然后我们需要知道我们的约束是什么。 Obviously, the earliest earliest1 we can select would be sentence 1, or sentence of index 0, but to know the max;显然,我们可以选择的最早的最早的 1 是句子 1,或索引为 0 的句子,但要知道最大值; we need to count the number of sentences, then subtract 1 to get the index of the last sentence.我们需要统计句子的数量,然后减去1得到最后一个句子的索引。

max_sentence = len(corpus.sents())-1

We'll create an empty list to store our [pseudo]random numbers in:我们将创建一个空列表来存储我们的 [伪] 随机数:

list_of_random_indexes = []

then get some numbers in it (50 of them in this case):然后在其中获取一些数字(在本例中为 50 个):

for i in range(50):
    list_of_random_indexes.append(random.randint(0, max_sentence))

Then finish off with a modified version of your last line which now references our list of random numbers instead of the range:然后用最后一行的修改版本结束,它现在引用了我们的随机数列表而不是范围:

corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

So all together:所以大家一起:

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = []
for i in range(50):
    list_of_random_indexes.append(random.randint(0, max_sentence))
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

Or to make that a bit cleaner:或者让它更干净一点:

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = [random.randint(0, max_sentence) for I in range(50)]
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

But since you might want to not have duplicate lines, I would also do a check before appending the index that it isn't already in the list.但是由于您可能不希望有重复的行,所以我还会在附加索引之前检查它不在列表中。

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = []
while len(list_of_random_indexes)<50:
    test_index = random.randint(0, max_sentence)
    if test_index not in list_of_random_indexes:
        list_of_random_indexes.append(test_index)
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

You can try something like this:你可以尝试这样的事情:

import numpy
length = len(nltk.corpus.semcor.sents())-50
for i in range(n_times):
   start = np.random.randint(0, length)
   corpus_sentences = [get_sentence_from_semcor(i) for i in range(start,(start+50))]

The code will iterate n_times returning a set of 50 sentences each time.该代码将迭代 n_times,每次返回一组 50 个句子。 'start' is a random integer in the range(0, length). 'start' 是范围(0,长度)中的随机整数。 (assuming that you know the total length of the corpus). (假设您知道语料库的总长度)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM