简体繁体 English

如何使用SharpNLP来检测一行文本是一个句子的可能性？

[英]How can I use SharpNLP to detect the possibility that a line of text is a sentence?

原文 2014-04-14 05:36:40 8 1 c#/ nlp/ opennlp/ sharpnlp

I've written a small C# program that compiles a bunch of words into a line of text and I want to use NLP only to give me a percentage possibility that the bunch of words is a sentence. 我编写了一个小型C＃程序，将一堆单词编译成一行文本，我只想使用NLP来给我一定百分比的可能性说一堆单词是一个句子。 I don't need tokens, or tagging, all that can be in the background if it needs to be done. 我不需要令牌或标记，如果需要的话，所有这些都可以在后台进行。 I have OpenNLP and SharpEntropy referenced in my project, but I'm coming up with an error "Array dimensions exceeded supported range." 我的项目中引用了OpenNLP和SharpEntropy，但是出现错误“数组尺寸超出支持范围”。 when using these, so I've also attempted using IKVM created OpenNLP without sharp entropy, but without documentation, I can't seem to wrap my head around the proper steps to get only the percentage probability. 在使用这些函数时，因此我也尝试使用IKVM创建的OpenNLP而没有尖锐的熵，但是在没有文档的情况下，我似乎无法将注意力集中在正确的步骤上以仅获取百分比概率。

Any help or direction would be appreciated. 任何帮助或指示，将不胜感激。

1 个解决方案

I'll recommend 2 relatively simple measures that might help you classify a word sequence as sentence/non-sentence. 我将推荐2种相对简单的措施，它们可以帮助您将单词序列分类为句子/非句子。 Unfortunately, I don't know how well SharpNLP will handle either. 不幸的是，我不知道SharpNLP会如何处理。 More complete toolkits exist in Java, Python, and C++ (LingPipe, Stanford CoreNLP, GATE, NLTK, OpenGRM, ...) Java，Python和C ++中存在更完整的工具包（LingPipe，Stanford CoreNLP，GATE，NLTK，OpenGRM等）

Language-model probability: Train a language model on sentences with start and stop tokens at the beginning/end of the sentence. 语言模型概率：在句子的开头/结尾处带有开始和结束标记的句子上训练语言模型。 Compute the probability of your target sequence per that language model. 根据该语言模型计算目标序列的概率。 Grammatical and/or semantically sensible word sequences will score much higher than random word sequences. 语法和/或语义上有意义的单词序列的得分将远高于随机单词序列。 This approach should work with a standard n-gram model, a discriminative conditional probability model, or pretty much any other language modeling approach. 这种方法应与标准n元语法模型，判别条件概率模型或几乎任何其他语言建模方法一起使用。 But definitely start with a basic n-gram model. 但绝对要从基本的n-gram模型开始。

Parse tree probability: Similarly, you can measure the inside probability of recovered constituency structure (eg via a probabilistic context free grammar parse). 解析树概率：同样，您可以测量恢复的选区结构的内部概率（例如，通过概率上下文无关文法解析）。 More grammatical sequences (ie, more likely to be a complete sentence) will be reflected in higher inside probabilities. 较高的内部概率将反映出更多的语法顺序（即，更有可能是完整的句子）。 You will probably get better results if you normalize by the sequence length (the same may apply to a language-modeling approach as well). 如果按序列长度进行归一化，可能会得到更好的结果（同样适用于语言建模方法）。

I've seen preliminary (but unpublished) results on tweets, that seem to indicate a bimodal distribution of normalized probabilities - tweets that were judged more grammatical by human annotators often fell within a higher peak, and those judged less grammatical clustered into a lower one. 我已经在推文上看到了初步（但未发表）的结果，这似乎表明归一化概率的双峰分布-由人类注释者判断为更语法化的推文通常落在较高的峰值内，而被判断为不那么语法化的推文属于较低的峰值。。 But I don't know how well those results would hold up in a larger or more formal study. 但是我不知道在更大或更正式的研究中这些结果能维持多久。