简体   繁体   English

NLP google cloud 20字以内

[英]NLP with less than 20 Words on google cloud

According to this documentation: the classifyText method requires at least 20 words.根据此文档:classifyText 方法至少需要 20 个单词。

https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs

If I send in less than 20 words I get this no matter how clear the content is:如果我发送的内容少于 20 个字,无论内容多么清晰,我都会收到:

Invalid text content: too few tokens (words) to process.

Looking for a way to force this without disrupting the NLP too much.寻找一种在不过多破坏 NLP 的情况下强制执行此操作的方法。 Are there neutral vector words that can be appended to short phrases that would allow the classifyText to process anyways?是否有中性向量词可以附加到允许分类文本处理的短语中?

ex.前任。

async function quickstart() {
    const language = require('@google-cloud/language');


    const client = new language.LanguageServiceClient();

  //less than 20 words. What if I append some other neutral words? 
//.. a, of , it, to or would it be better to repeat the phrase?


    const text = 'The Atlanta Braves is the best team.';


    const document = {
        content: text,
        type: 'PLAIN_TEXT',
    };


    const [classification] = await client.classifyText({document});
    console.log('Categories:');
    classification.categories.forEach(category => {
        console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
    });

}

quickstart();

The problem with this is you're adding bias no matter what kind of text you send.这样做的问题是,无论您发送什么样的文本,都会增加偏见。

Your only chance is to fill up your string up to the minimum word limit with empty words that will be filtered out by the preprocessor and tokenizer before they go to the neural network.你唯一的机会是用空词填充你的字符串到最小字数限制,这些空词将在它们 go 到神经网络之前被预处理器和标记器过滤掉。

I would try to add a string suffix at the end of the sentence with just stopwords from NLTK like this:我会尝试在句子末尾添加一个字符串后缀,只使用来自NLTK的停用词,如下所示:

document.content += ". and ourselves as herserf for each all above into through nor me and then by doing"

Why the end?为什么要结束? Because usually text has more information at the beginning.因为通常文本在开头有更多信息。

In case Google does not filter stopwords behind the scenes (which I doubt), this would add just white noise where the network has no focus or attention.如果谷歌没有在幕后过滤停用词(我对此表示怀疑),这只会在网络没有焦点或注意力的地方添加白噪声。

Remember: DO NOT add this string when you have enough words because you are billed for 1K character blocks before they are filtered.请记住:当您有足够的单词时不要添加此字符串,因为在过滤之前您需要为 1K 字符块付费。

I would also add that string suffix to sencences in your train/test/validation set that have less than 20 words and see how it works.我还会将该字符串后缀添加到您的训练/测试/验证集中少于 20 个单词的句子中,看看它是如何工作的。 The network should learn to ignore the whole sentence.网络应该学会忽略整个句子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 谷歌云 NLP 与 UiPath 集成 - Google cloud NLP integration with UiPath Google Cloud NLP-未返回任何实体 - Google Cloud NLP - No Entities Returned 使用Swift请求错误的Google Cloud NLP API - Request error Google Cloud NLP API with Swift 使用 Google Cloud NLP 进行嵌套命名实体识别 - Nested Named Entity Recognition with Google Cloud NLP 有什么办法可以在 Google Cloud Spanner 中删除超过 20k 的变异? - Is there any way I an delete more than 20k mutation in Google Cloud Spanner? 即使谷歌云中的磁盘已满,快照大小是否小于实际磁盘大小? - Does snapshot size is less than actual disk size even disk is full in google cloud? 将 Google Cloud NLP API 实体情绪输出转换为 JSON - converting Google Cloud NLP API entity sentiment output to JSON Python NLP:使用 TextBlob、StanfordNLP 或 Google Cloud 识别句子的时态 - Python NLP: identifying the tense of a sentence using TextBlob, StanfordNLP or Google Cloud 谷歌云 NLP API (documents.analyzeEntities) 的 Json 格式 - Json formatting for Google cloud NLP API (documents.analyzeEntities) 谷歌云:CDN 和 Cloud Run 响应请求长达 20 秒 - Google cloud: CDN and Cloud Run responses the requests up to 20 sec
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM