[英]What's a good set of heuristics for threading tweets?
Everyone knows, if you want to thread emails you use Jamie Zawinski's algorithm . 大家都知道,如果您想发送电子邮件,则可以使用Jamie Zawinski的算法 。 But it's a new century, and there's a new messaging service. 但这是一个新世纪,并且有了新的消息传递服务。
What's the best algorithm for threading status updates posted on Twitter? 在Twitter上发布状态更新的最佳算法是什么?
Things I'd definitely like it to cope with: 我绝对希望它处理的事情:
The easy part: using in_reply_to_status_id
, in_reply_to_user_id
and in_reply_to_screen_name
. 简单的部分:使用in_reply_to_status_id
, in_reply_to_user_id
和in_reply_to_screen_name
。 (Incidentally, finding proper documentation of these values would be useful in itself! Such documentation isn't obviously linked to from here , for example.) (顺便说一下,例如,找到这些值的适当文档本身将是有用的!例如,显然没有从此处链接此类文档。)
Good heuristics for inferring a "reply" relationship from messages that mention a user with the @
convention but aren't explicitly in reply to a particular message. 良好的试探法,可从提及使用@
约定的用户的消息中推断出“回复”关系,但未明确答复特定消息。 These "mentions" are provided in the "entities" element of statuses now if you request that. 如果您要求, 现在在状态的“实体”元素中提供这些“提及”。 These heuristics might take into account (a) the time between two status updates, (b) whether there are subsquent replies between the two users, etc. (Replies that consist of an old-style retweet with an additional comment, as mentioned by user85509 below are just an instance of this style of reply.) 这些试探法可能会考虑到(a)两个状态更新之间的时间间隔,(b)两个用户之间是否有后续答复,等等。(如由user85509提及的包含旧式转发和附加注释的回复)以下只是这种回复方式的一个实例。)
Conversations that take place between more than two users. 超过两个用户之间的对话。
Working with a set of tweets given to the algorithm, or all tweets on Twitter. 使用给该算法的一组推文,或Twitter上的所有推文。
... but perhaps you can think of more. ...但是也许您可以想到更多。
Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question. 由于只有一个答案,赏金截止日期快到了,我想我应该添加一个基准答案,这样赏金不会自动授予不会超出问题实质的答案。
The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id
links to build many directed acyclic graphs. 显而易见的第一步是采用原始的一组推文,并遵循所有in_reply_to_status_id
链接来构建许多有向无环图。 These relationships you can be nearly 100% sure about. 您可以几乎100%确定这些关系。 (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.) (您甚至应该通过原始集合中没有的推文来跟踪这些链接,并将其添加到您正在考虑的状态更新集中。)
Beyond that easy step, one has to do deal with the "mentions". 除了这一简单的步骤外,还必须处理“提法”。 Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. 与电子邮件线程化不同,没有什么比它可以匹配的主题行更有用了–这不可避免地非常容易出错。 The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option. 我将采用的方法是为可能由该推文中的提及表示的状态ID之间的每个可能关系创建一个特征向量,然后训练分类器猜测最佳选项,包括“不回复”选项。
To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id
. 要计算“一切可能的关系”位,请考虑所有提及一个或多个其他用户且不包含in_reply_to_status_id
状态更新。 Suppose an example of one of these tweets is: 1 假设这些推文之一的示例是: 1
@a @b no it isn't lol RT @c Yes, absolutely. /cc @stephenfry
... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of @a
, @b
, @c
, and @stephenfry
for the last week (say) and one between that update and a special "no reply" update. ...您将创建此更新,并在时限提前每次更新之间的关系的特征向量@a
, @b
, @c
和@stephenfry
在上周(说)和一个更新之间并进行特殊的“不回复”更新。 Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding: 然后,您必须创建一个特征向量-您可以根据需要添加此向量,但是我至少建议添加:
following / followed
ratio for the author of the original update. following / followed
是原始更新作者的following / followed
比率。 The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. 由于分类器将仅使用那些证明是有用的分类器,因此可以提出的越多越好。 I'd suggest trying a random forest classifier, which is conveniently implemented in Weka . 我建议尝试使用随机森林分类器,该分类器在Weka中方便地实现。
Next one needs a training set. 下一个需要训练集。 This can be small at first - just enough to get a service that identifies conversations up-and-running. 刚开始时,它可能很小—刚好足以获得识别正在运行的对话的服务。 To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. 对于此基本服务,必须添加一个不错的界面来更正不匹配或错误链接的更新,以便用户可以对其进行更正。 Using this data one can build a bigger training set and a more accurate classifier. 使用这些数据可以建立更大的训练集和更准确的分类器。
1 ... which might be typical of the level of discourse on Twitter ;) 1 ...这可能是Twitter上的典型讨论水平;)
在Twitter上,人们经常在要回复的消息前写“ RT”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.