简体   繁体   中英

Representation and a good similarity measure between Tweets for topic detection

I'm planning to write a tool for Topic Detection on Twitter . I've been thinking about a good similarity measure (distance) between two tweets , and how to represent them, taking in count:

  • The #hashtags (I think hashtags are very important when detecting topics on Twitter)
  • The replies (if someone replies to a tweet , those tweets could be talking about the same topic, although two people could start talking about samsung galaxy and end talking about iphone jailbreaking , etc.)

I'm thinking about implementing what I have so far and do some experiments. I'll implement the classic models (like TF*IDF and use the euclidian distance , angle cosine , etc.), and the boolean models with a few similarity measures ( Hamming , Jaccard , etc.).

Any ideas of how to adapt some existing model to Twitter or a few ideas about how to create a new one?

Similarity Metrics on Twitter discusses some details about the different similarity measures that you can use for clustering data from twitter together. We did some research on clustering users on twitter based on the user connections, user mentions, geo-location, the content similarity between tweets, content similarity between user descriptions and the common #hashtags.

For finding common topics on twitter, finding connections between the users discussing about the topics really helps and we found that group of users tend to discuss a common topic. There is some detail about this in the second half of this post .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM