简体   繁体   English

从文本文件中选择少于 280 个字符的随机句子

[英]Selecting a random sentence less than 280 characters from a text file

I am working on a project where I want to read a large text file, randomly select a full sentence from that file.我正在做一个项目,我想读取一个大文本文件,随机 select 从该文件中读取一个完整的句子。 If that file sentence is less 280 characters or less, print that file out.如果该文件语句少于 280 个字符或更少,则打印该文件。 if not select another sentence until it finds a sentence that is less than 280 characters.如果不是 select 另一个句子,直到找到少于 280 个字符的句子。 Using nltk I am able to break down the text into individual sentences, select one randomly and count the characters.使用 nltk 我可以将文本分解为单个句子,select 随机一个并计算字符。

import nltk.data
import random

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

fp = open("test.txt")

data = fp.read()

tok = tokenizer.tokenize(data); #breaks into sentences 

newTok = random.choice(tok) #selects random sentence 

length = len(newTok) #gives amount of characters in random sentence

I am now trying to work create a while loop that will test if a sentence is less than 280 chcaracters to print it, and if it is not will select another sentence randomly to test我现在正在尝试创建一个while循环,该循环将测试一个句子是否少于 280 个字符来打印它,如果不是,则 select 将随机测试另一个句子

while length < 280:  # while length of sentence is less than 280

      print "length of sentence = ", length # do this 
      print newTok # do this 
      break #stops loop

      else: 
          print length, " is too long" 

but this is giving me an invalid syntax error on else, but also I think will not iterate again to find another sentence.但这给了我一个无效的语法错误,但我认为不会再次迭代以找到另一个句子。

Any suggestions would be great.任何建议都会很棒。

After getting the list of tokens:获取令牌列表后:

tok = tokenizer.tokenize(data); #breaks into sentences 

...the rest is a one-liner: ... rest 是单线:

newTok = random.choice([x for x in tok if len(x)<280])

Note the use of a list comprehension with an if to narrow the items from the token list to those whose length is less than 280 characters.请注意,使用带有if的列表推导可以将令牌列表中的项目缩小到长度小于 280 个字符的项目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM