简体   繁体   English

用于主题检测的推文之间的表示和良好的相似性度量

[英]Representation and a good similarity measure between Tweets for topic detection

I'm planning to write a tool for Topic Detection on Twitter . 我打算在Twitter上编写一个主题检测工具。 I've been thinking about a good similarity measure (distance) between two tweets , and how to represent them, taking in count: 我一直在考虑两个推文之间的良好相似性度量(距离),以及如何表示它们,计入:

  • The #hashtags (I think hashtags are very important when detecting topics on Twitter) #hashtags (我认为在Twitter上检测主题时,主题标签非常重要)
  • The replies (if someone replies to a tweet , those tweets could be talking about the same topic, although two people could start talking about samsung galaxy and end talking about iphone jailbreaking , etc.) 回复(如果有人回复推文 ,那些推文可能会谈论相同的主题,虽然有两个人可以开始谈论三星银河并最终谈论iphone越狱等)

I'm thinking about implementing what I have so far and do some experiments. 我正在考虑实施到目前为止所做的工作并做一些实验。 I'll implement the classic models (like TF*IDF and use the euclidian distance , angle cosine , etc.), and the boolean models with a few similarity measures ( Hamming , Jaccard , etc.). 我将实现经典模型(如TF*IDF并使用欧几里德距离角度余弦等),以及具有一些相似性度量( 汉明Jaccard等)的布尔模型。

Any ideas of how to adapt some existing model to Twitter or a few ideas about how to create a new one? 有关如何使某些现有模型适应Twitter或关于如何创建新模型的一些想法的任何想法?

Similarity Metrics on Twitter discusses some details about the different similarity measures that you can use for clustering data from twitter together. Twitter上的相似度量标准讨论了有关不同相似性度量的一些细节,您可以将这些度量用于将来自twitter的数据聚集在一起。 We did some research on clustering users on twitter based on the user connections, user mentions, geo-location, the content similarity between tweets, content similarity between user descriptions and the common #hashtags. 我们根据用户连接,用户提及,地理位置,推文之间的内容相似性,用户描述之间的内容相似性以及常见的#hashtags,对Twitter上的用户进行了一些研究。

For finding common topics on twitter, finding connections between the users discussing about the topics really helps and we found that group of users tend to discuss a common topic. 为了在twitter上查找常见主题,在讨论主题的用户之间找到联系确实有帮助,我们发现用户组倾向于讨论共同主题。 There is some detail about this in the second half of this post . 这篇文章的后半部分有一些细节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM