[英]NLP with less than 20 Words on google cloud
According to this documentation: the classifyText method requires at least 20 words.根据此文档:classifyText 方法至少需要 20 个单词。
https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs
If I send in less than 20 words I get this no matter how clear the content is:如果我发送的内容少于 20 个字,无论内容多么清晰,我都会收到:
Invalid text content: too few tokens (words) to process.
Looking for a way to force this without disrupting the NLP too much.寻找一种在不过多破坏 NLP 的情况下强制执行此操作的方法。 Are there neutral vector words that can be appended to short phrases that would allow the classifyText to process anyways?是否有中性向量词可以附加到允许分类文本处理的短语中?
ex.前任。
async function quickstart() {
const language = require('@google-cloud/language');
const client = new language.LanguageServiceClient();
//less than 20 words. What if I append some other neutral words?
//.. a, of , it, to or would it be better to repeat the phrase?
const text = 'The Atlanta Braves is the best team.';
const document = {
content: text,
type: 'PLAIN_TEXT',
};
const [classification] = await client.classifyText({document});
console.log('Categories:');
classification.categories.forEach(category => {
console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
});
}
quickstart();
The problem with this is you're adding bias no matter what kind of text you send.这样做的问题是,无论您发送什么样的文本,都会增加偏见。
Your only chance is to fill up your string up to the minimum word limit with empty words that will be filtered out by the preprocessor and tokenizer before they go to the neural network.你唯一的机会是用空词填充你的字符串到最小字数限制,这些空词将在它们 go 到神经网络之前被预处理器和标记器过滤掉。
I would try to add a string suffix at the end of the sentence with just stopwords from NLTK like this:我会尝试在句子末尾添加一个字符串后缀,只使用来自NLTK的停用词,如下所示:
document.content += ". and ourselves as herserf for each all above into through nor me and then by doing"
Why the end?为什么要结束? Because usually text has more information at the beginning.因为通常文本在开头有更多信息。
In case Google does not filter stopwords behind the scenes (which I doubt), this would add just white noise where the network has no focus or attention.如果谷歌没有在幕后过滤停用词(我对此表示怀疑),这只会在网络没有焦点或注意力的地方添加白噪声。
Remember: DO NOT add this string when you have enough words because you are billed for 1K character blocks before they are filtered.请记住:当您有足够的单词时不要添加此字符串,因为在过滤之前您需要为 1K 字符块付费。
I would also add that string suffix to sencences in your train/test/validation set that have less than 20 words and see how it works.我还会将该字符串后缀添加到您的训练/测试/验证集中少于 20 个单词的句子中,看看它是如何工作的。 The network should learn to ignore the whole sentence.网络应该学会忽略整个句子。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.