简体   繁体   中英

Read randomly sentences from nltk corpus

I'm working on my university project and I have to read randomly 50 sentences from a NLTK Corpus (SemCor).

Currently I was only able to read the first 50 sentences as following:

from nltk.corpus import semcor as corpus

def get_sentence_from_semcor(sentence_num):
   sentence = " ".join(corpus.sents()[sentence_num])
   tags = corpus.tagged_sents(tag="sem")[sentence_num]
   for curr_word in range(len(tags)):
         if isinstance(tags[curr_word], nltk.Tree) and isinstance(tags[curr_word][0], str) and isinstance(tags[curr_word].label(), nltk.corpus.reader.wordnet.Lemma):
             word = tags[curr_word][0]
             target = tags[curr_word].label().synset()
             sentence_no_word = sentence.replace(word, "")
   return word, sentence_no_word, target

   corpus_sentences = [get_sentence_from_semcor(i) for i in range(50)]

Any Help on how I could select randomly 50 sentences of the corpus?

Well you are wanting randomness, so let's import the random library:

import random

Then we need to know what our constraints are. Obviously, the earliest earliest1 we can select would be sentence 1, or sentence of index 0, but to know the max; we need to count the number of sentences, then subtract 1 to get the index of the last sentence.

max_sentence = len(corpus.sents())-1

We'll create an empty list to store our [pseudo]random numbers in:

list_of_random_indexes = []

then get some numbers in it (50 of them in this case):

for i in range(50):
    list_of_random_indexes.append(random.randint(0, max_sentence))

Then finish off with a modified version of your last line which now references our list of random numbers instead of the range:

corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

So all together:

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = []
for i in range(50):
    list_of_random_indexes.append(random.randint(0, max_sentence))
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

Or to make that a bit cleaner:

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = [random.randint(0, max_sentence) for I in range(50)]
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

But since you might want to not have duplicate lines, I would also do a check before appending the index that it isn't already in the list.

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = []
while len(list_of_random_indexes)<50:
    test_index = random.randint(0, max_sentence)
    if test_index not in list_of_random_indexes:
        list_of_random_indexes.append(test_index)
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

You can try something like this:

import numpy
length = len(nltk.corpus.semcor.sents())-50
for i in range(n_times):
   start = np.random.randint(0, length)
   corpus_sentences = [get_sentence_from_semcor(i) for i in range(start,(start+50))]

The code will iterate n_times returning a set of 50 sentences each time. 'start' is a random integer in the range(0, length). (assuming that you know the total length of the corpus).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM