简体   繁体   English

如何仅从单词生成有意义的句子?

[英]How to generate a meaningful sentence from words only?

I want to generate a sentence from list of words.我想从单词列表中生成一个句子。 I have tried n-gram model but it only generates the text from already existing sentence ie we input a sentence and it outputs the next generated words based on the value of n.我已经尝试过 n-gram 模型,但它只从已经存在的句子生成文本,即我们输入一个句子,它根据 n 的值输出下一个生成的单词。 Which model will be helpful to generate a meaningful sentence from only the list of words and which dataset should be used to train the model?哪个模型将有助于仅从单词列表生成有意义的句子,以及应该使用哪个数据集来训练模型?

The dataset: Just take a dataset constisting of sentences.数据集:只需获取一个由句子组成的数据集。 Tokenize each sentence and shuffle the sentences.标记每个句子并打乱句子。 These shuffled tokens are your input, your sentence the output.这些打乱的标记是您的输入,您的句子是 output。 Therefore you can generate as many samples as you wish:因此,您可以根据需要生成任意数量的样本:

def create_input(sentence):
    tokens = nltk.word_tokenize(sentence)
    shuffle(tokens)
    return tokens

More difficult is the model : You could try to Fine-Tune a BERT model and I guess it will probably work.更困难的是model :您可以尝试微调 BERT model,我想它可能会起作用。

What you want is called lexically constrained beam search in natural language generation literature.你要的在自然语言生成文献中叫做lexically constrained beam search。

pip install -q git+https://github.com/huggingface/transformers.git

then this code can generated a sentence with the forced words list.然后这段代码可以用强制词列表生成一个句子。

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

encoder_input_str = "Generate a sentence:"

force_words = ["I", "school"]

input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids

outputs = model.generate(
    input_ids,
    force_words_ids=force_words_ids,
    num_beams=5,
    num_return_sequences=1,
    no_repeat_ngram_size=1,
    remove_invalid_values=True,
)


print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For further information refer to this .有关更多信息,请参阅

If you don't want to use deep learning, index a lot of sentences, search for the keywords using a retrieval system like Lucence, and retrieve a sentence that is closest to your query.如果你不想使用深度学习,索引很多句子,使用像 Lucence 这样的检索系统搜索关键字,并检索最接近你的查询的句子。

You can use GPT-J.您可以使用 GPT-J。 It is a free GPT model and its performance is comparable to GPT-3.它是免费的 GPT model,其性能可与 GPT-3 媲美。 The model takes the input that you provide it with, and tries to complete it. model 接受您提供的输入,并尝试完成它。

How I use GPT-J to generate a sentence from a set of keywords:我如何使用 GPT-J 从一组关键字生成句子:

Input:输入:

Make a sentence with the following words: earth, dirt, alligator
Sentence: While the alligator is a species which mainly lives in the water, the earth is not uncommon territory and they like to dig through the dirt.

Make a sentence with the following words: shape, lantern, hair
Sentence: 

Output: Output:

Make a sentence with the following words: earth, dirt, alligator
Sentence: While the alligator is a species which mainly lives in the water, the earth is not uncommon territory and they like to dig through the dirt.

Make a sentence with the following words: shape, lantern, hair
Sentence: The hair is so thick on the lantern that it is almost like a shape.

How to tweak to a certain use-case?如何调整到某个用例?

Giving an example of what you want in the input (example keywords + sentence) can help GPT to understand the structure of the desired output.在输入中给出您想要的示例(示例关键字 + 句子)可以帮助 GPT 理解所需 output 的结构。 Explicitly explaining the GPT what the desired task is in the input (Make a sentence...) can help it to understand the task in my experience.明确解释 GPT 在输入中期望的任务是什么(造句……)可以帮助它理解我的经验中的任务。

You can change the complexity of the output sentence by changing the example sentence to something like: An alligator likes to dig dirt out of the earth.您可以通过将例句更改为类似以下内容来更改 output 句子的复杂性: An alligator likes to dig dirt out of the earth.

How to use?如何使用?

Git repo: https://github.com/kingoflolz/mesh-transformer-jax Git 回购: https://github.com/kingoflolz/mesh-transformer-jax

As shown in the repo, you can use the web demo of the model for testing, and you can implement it using Colab.如repo所示,可以使用model的web demo进行测试,也可以使用Colab实现。

Web demo: https://6b.eleuther.ai/ Web 演示: https://6b.eleuther.ai/

Colab notebook: http://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb Colab 笔记本: http://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb

I do not recommend trying to run it locally.我不建议尝试在本地运行它。

Thanks to text generation models like GPT-3, GPT-J, and GPT-NeoX, you can generate content out of simple keywords.得益于 GPT-3、GPT-J 和 GPT-NeoX 等文本生成模型,您可以使用简单的关键字生成内容。

For example, let's say you want to generate a product description out of a couple of keywords, you could use few-shot learning and do something like this:例如,假设您想用几个关键词生成产品描述,您可以使用小样本学习并执行如下操作:

Generate a product description out of keywords.

Keywords: shoes, women, $59
Sentence: Beautiful shoes for women at the price of $59.
###
Keywords: trousers, men, $69
Sentence: Modern trousers for men, for $69 only.
###
Keywords: gloves, winter, $19
Sentence: Amazingly hot gloves for cold winters, at $19.
###
Keywords: t-shirt, men, $39
Sentence:

I actually wrote an article about this that you might find useful: effectively using GPT-J with few-shot learning我实际上写了一篇关于此的文章,您可能会发现它很有用: effectively using GPT-J with few-shot learning

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM