简体   繁体   English

如何在Python中使这个随机文本生成器更有效?

[英]How to make this random text generator more efficient in Python?

I'm working on a random text generator -without using Markov chains- and currently it works without too many problems. 我正在研究一个随机文本生成器 - 没有使用马尔可夫链 - 目前它的工作没有太多问题。 Firstly, here is my code flow: 首先,这是我的代码流程:

  1. Enter a sentence as input -this is called trigger string, is assigned to a variable- 输入一个句子作为输入 - 这称为触发字符串,分配给变量 -

  2. Get longest word in trigger string 获取触发器字符串中最长的单词

  3. Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase- 在所有Project Gutenberg数据库中搜索包含该单词的句子 - 无大写小写 -

  4. Return the longest sentence that has the word I spoke about in step 3 返回我在第3步中谈到的最长句子

  5. Append the sentence in Step 1 and Step4 together 一起附加步骤1和步骤4中的句子

  6. Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. 将步骤4中的句子指定为新的“触发器”句子并重复该过程。 Note that I have to get the longest word in second sentence and continue like that and so on- 请注意,我必须在第二句中获得最长的单词并继续这样,依此类推 -

And here is my code: 这是我的代码:

import nltk
from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of  list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-

while triggerSentence:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)

    #code to get the sentences containing the longest word, then selecting
    #random one of these sentences that are longer than 40 characters
    sets = []
    for sentence in listOfSents:
        if sentence.count(longestString):
            sents= " ".join(sentence)
            if len(sents) > 40:
            sets.append(" ".join(sentence))

    triggerSentence = choice(sets)
    print triggerSentence

My concern is, the loop mostly reaches to a point where the same sentence is printed over and over again. 我担心的是,循环大多达到了一遍又一遍地打印相同句子的程度。 Since it is the longest sentence that has the longest word. 因为它是最长的单词,具有最长的单词。 To counter getting the same sentence over and over again, I thought of the following: 为了反复得到同一句话,我想到了以下几点:

*If the longest word in the current sentence is the same as it was in the last sentence, simply delete this longest word from the current sentence and look for the next longest word. *如果当前句子中最长的单词与最后一句中的最长单词相同,只需从当前句子中删除这个最长的单词,然后查找下一个最长的单词。

I tried some implementations for this but failed to apply the solution above since it involves lists and list of lists -due to words and sentences from gutenberg module-. 我尝试了一些实现,但未能应用上面的解决方案,因为它涉及列表和列表列表 - 由于来自gutenberg模块的单词和句子。 Any suggestions about how to find the second longest word ? 有关如何找到第二长字的任何建议? I seem to be unable to do this with parsing a simple string input since .sents() and .words() functions of NLTK's Gutenberg module yield list of list and list respectively. 我似乎无法解析一个简单的字符串输入,因为NLTSK的Gutenberg模块的.sents()和.words()函数分别产生列表和列表的列表。 Thanks in advance. 提前致谢。

Some suggested improvements: 一些建议的改进:

  1. The while loop will run forever, you should probably remove it. while循环将永远运行,你应该删除它。
  2. Use max and generator expressions to generate the longest word in a memory-efficient manner. 使用max和generator表达式以内存有效的方式生成最长的单词。
  3. You should generate a list of sentences with a length greater than 40 characters that include longestWord with a list comprehension. 您应该生成一个长度超过40个字符的句子列表,其中包括具有列表理解的longestWord This should also be removed from the while loop, as it only happens. 这也应该从while循环中删除,因为它只会发生。

    sents = [" ".join(sent) for sent in listOfSents if longestWord in sent and len(sent) > 40] sents = [" ".join(sent) for sent in listOfSents if longestWord in sent and len(sent) > 40]

  4. If you want to print out every sentence that is found in a random order, then you could try shuffling the list you just created: 如果要打印出随机顺序中找到的每个句子,那么您可以尝试改组刚刚创建的列表:

    for sent in random.shuffle(sents): print sent

This is how the code could look with these changes: 这是代码在这些更改中的外观:

import nltk
from nltk.corpus import gutenberg
from random import shuffle

listOfSents = gutenberg.sents()
triggerSentence = raw_input("Please enter the trigger sentence: ")

longestWord = max(triggerSentence.split(), key=len)
longSents = [" ".join(sent) for sent in listOfSents 
                 if longestWord in sent 
                 and len(sent) > 40]

for sent in shuffle(longSents):
    print sent

如果你需要的只是生成随机文本(我想,它要求它应该包含有意义的句子)你可以更简单地做到:只需生成随机数并将它们用作索引从文本数据库中检索句子(无论是Project Gutenberg还是随你)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM