简体   繁体   中英

Text clustering program in java

I am having to develop a project in core java in which i am going to take some 100 lines of text from the user. Now, I want to break the whole text into clusters wherein each cluster will relate to a keyword for example suppose i have text like:

"Java is an object oriented language. It uses classes for modularisation. bla bla bla...

C++ is also an object oriented language. bla bla bla...

Something about OOPS concepts here..."

Now, if i give this whole text as input to the program, i want that the program shall create directories with the name of the keywords and it also shall choose the keywords on its own. I am expecting that the keyword in this text are Java, Modularisation, C++, OOPS. In the later stages of this program, I would be dealing with different texts so i have to make this program intelligent enough to understand which words are keywords and which are not. So that it can work with any piece of text.

So, I have looked up many places, asked many people, and watched many tutorials only to find that they are mostly clustering numerical data. But, rarely anyone is dealing with text clustering. I am looking for an algorithm or a way which can do this work.

Thanks

The reason why you are only finding tutorials is because algorithms of the area of machine learning need numerical data. So you have to convert your data in a numerical format. To create a numerical represantation of text there are a number of algorithms. As example the Levenshtein distnace . With this distance measures you have a numerical represantation and the clustering algorithms are applicable. As example you can use the k-Means algorithm or any other to cluster your text data.

You should also google a bit about text mining, there are many good examples in the web. This link could be a good resource

There are a variety of approaches that you can use to pre-process your text and then to cluster that processed data. An example would be to generate the bag-of-words representation of the text and the apply clustering methods.

However, I would personally choose LDA topic modeling. This algorithm by itself does not 'cluster' your text, but can be used as a pre-processing step for text clustering. It is another unsupervised approach that gives you a list of 'topic's associated with a set of documents or sentences. These topic are actually a set of words that are deemed to be relevant to each other based on how they appear in the underlying text. For instance, the following are three topics extracted from a set of tweets:

  • food, wine, beer, lunch, delicious, dining
  • home, real estate, house, tips, mortgage, real estate
  • stats, followers, unfollowers, checked, automatically

Then you can calculate the probability of a sentence belonging to each of these topics by counting the number times these words appear in the sentence and the total word count. Finally, these probability values can be used for text clustering. I should also note that these words generated by LDA are weighted, so you can use the one with the largest weight as your main keyword. For instance, 'food', 'home', and 'stats' have the largest weight in the above lists, respectively.

For LDA implementation, check out Mallet library developed in Java.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM