简体   繁体   中英

Tweet clustering after semantic analysis

I want to cluster a set of tweets. I have already retrieved the tweets, cleaned them up, applied a Naïve Bayes classifier to them and divided them into two files, positive and negative. Finally, I have done the following to search for similarities between each tweet:

  with open("positive.txt", "r") as pt:
        lines = pt.readlines()
        for lineA in lines:
            vectorA = text_to_vector(lineA)
            for lineB in lines:
                vectorB = text_to_vector(lineB)
                cosine = get_cosine(vectorA, vectorB)
                print lineA, "\n", lineB, "\n", "Cosine:", cosine

Now this is supposed to measure the similarity of each sentence relative to the other, I was thinking the next step might be to add up the values for the all the phrases so add up all the cosine values for the relation of sentence n to all the sentence, and after doing this, plot them and apply something like KMeans, I'm not entirely sure I'm taking a correct approach here, so any help is much appreciated.

If you have a set of documents that you want to cluster (based on their content), the easiest option is to use the tool Cluto . You basically have to run it in two steps.

The first step is to execute the program doc2mat which takes an input file that is supposed to contain all the documents, one document in each line. The doc2mat program will write out a matrix file, comprised of the tf-idf vector representation for each document.

You then need to feed in this matrix file to the program vcluster which will produce the clustering results. You can also evaluate the clustering results, if you input a reference class file to vcluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM