简体   繁体   English

使用C#或任何开源文本挖掘API从文本文件中分离出有意义的单词

[英]separate meaningful words from text file using C# or any open source text mining API

I am working on a video processing project in which i extract text from video given as input and save that text in a text file.I have the text which has garbage text as well as words , i now need to separate out meaningful words from the generated text and convert them into tags? 我正在一个视频处理项目中,我从输入的视频中提取文本并将其保存在文本文件中。我的文本既包含垃圾文本又包含单词,我现在需要从页面中分离出有意义的单词生成文本并将其转换为标签? can anyone suggest API/algorithm that can be use for this ? 谁能建议可以用于此的API /算法?

您可以看一下Apache OpenNLP (自然语言处理)和C#派生的SharpNLP

You can use the SharpNLP with the SharpEntropy.dll and OpenNLP.dll for doing this along with the following snippet. 您可以将SharpNLP与SharpEntropy.dll和OpenNLP.dll一起使用,并与以下代码段一起使用。

private OpenNLP.Tools.Tokenize.EnglishMaximumEntropyTokenizer mTokenizer;
private string[] Tokenize(string text)
{
    if (mTokenizer == null)
    {
        mTokenizer = new OpenNLP.Tools.Tokenize.EnglishMaximumEntropyTokenizer(mModelPath + "EnglishTok.nbin");
    }
    return mTokenizer.Tokenize(text);
}

Now you will have a string array of tokens. 现在,您将具有令牌的字符串数组。 I mean a string array containing all data. 我的意思是包含所有数据的字符串数组。 Junk may be included. 可能包括垃圾。 Now you have to separate only the meaningful tokens. 现在,您仅需分离有意义的标记。 For this you can use the NHunspell.dll 为此,您可以使用NHunspell.dll

public list<string> validate(string[] tokens)
{
      Hunspell hunspell = new Hunspell("en_US.aff", "en_US.dic");
      List<string> valid_tokens = new List<string>();
      foreach (string token in tokens)
      {
           if (!hunspell.Spell(token))
           {
                valid_tokens.Add(token);
           }
      }
      hunspell.Dispose();
      return valid_tokens;
}

Now you will have a list valid_tokens that contain only valid tokens that have a meaning in English. 现在你将有一个仅包含有英文的意思有效标记列表valid_tokens。 Hope this solves your problem. 希望这能解决您的问题。

For a step by step way of integrating SharpNLP into your Visual Studio Project, go though this detailed article that I have written. 有关将SharpNLP集成到Visual Studio项目中的逐步方法,请阅读我写的这篇详细文章。 Easy way of Integrating SharpNLP with a Visual Studio C# Project 将SharpNLP与Visual Studio C#项目集成的简便方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM