简体   繁体   English

简单的Java句子分类程序

[英]Simple Java Sentence Classification program

I need some help with a form of sentence classification program. 我需要某种形式的句子分类程序的帮助。

A program that reads a files and checks each sentence in the file to locate any 'keywords'. 一个程序,读取文件并检查文件中的每个句子以找到所有“关键字”。 Keywords are in another file with words written inside it. 关键字在另一个文件中,里面写有单词。 If it finds a keyword, it writes that sentence into another file. 如果找到关键字,则会将该句子写入另一个文件。

So far I am cool with the reading of the input file and splitting into sentences and writing of the output file 到目前为止,我对读取输入文件,拆分成句子并写入输出文件很满意

Can you please give some direction with how the program will read each sentence in the first file and compare it against the words in the second file and if it locates a keyword in the sentence write into a third file? 您能否就程序将如何读取第一个文件中的每个句子并将其与第二个文件中的单词进行比较,以及是否在句子中找到关键字的方式写入第三个文件,给出一些指导?

Many thanks! 非常感谢!

You can use Scanner to read the file and have directly words extracted. 您可以使用扫描仪读取文件并直接提取单词。

You can load all the keyword to compare in a TreeSet, then if found you write to the FileWriter 您可以加载所有关键字以在TreeSet中进行比较,然后如果找到,则写入FileWriter

Keywords are a set, I presume. 我想关键字是一组。 You'll need to have fast access to them, so use HashSet . 您需要快速访问它们,因此请使用HashSet

If your keywords should match only and only on exact counterparts in your sentenece, split your sentence by any whitespaces ( \\\\s+ regex) and try to match each word of a sentence with the elements in keywords. 如果您的关键字只应匹配句子中的完全匹配的句子,则应将其用任何空格( \\\\s+ regex)分隔,并尝试将句子中的每个单词与关键字中的元素进行匹配。

You can build a dependency grap this way, so you can have a HashMap where kewords are keys, and values are Sets of sentences referencing that keyword. 您可以通过这种方式构建依赖关系抓取,因此您可以拥有一个HashMap,其中keword是关键字,而value是引用该关键字的句子集。

In the end, you could end up with something like this: 最后,您可能会得到如下结果:

[Keyword="StackOverflow"]
    [Values=
        "I like posting on StackOverflow.",
        "StackOverflow is cool."
    ]
[Keyword="posting"]
    [Values=
        "I like posting on StackOverflow."
    ]

I can give a PHP based solution- 我可以给出一个基于PHP的解决方案-

  1. Parse the sentence string. 解析句子字符串。
  2. Use strtok() function, and define common punctuation(",'()/ etc) as tokenizers/classifiers. 使用strtok()函数,并将通用标点符号(“,'()/等)定义为标记器/分类器。

  3. form an array/set of the data dictionary containing the pre-defined words. 形成包含预定义单词的数据字​​典的数组/集。

  4. Use preg_match() function, for complete word match, u might wanna use specified structures to build the array as--> $variable = array ("/(\\bword1\\b)/", "/(\\bword2\\b)/"). 使用preg_match()函数,以实现完整的单词匹配,您可能想使用指定的结构来构建数组,即-> $ variable = array(“ /(\\ bword1 \\ b)/”,“ /(\\ bword2 \\ b)/ ”)。

  5. For reference purposes/specifications of the above mentioned functions, search the php docs at - http://www.php.net/ 为了上述功能的参考目的/规范,请在-http://www.php.net/上搜索php文档

Hope i could help. 希望我能帮上忙。

Cheers. 干杯。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM