简体   繁体   English

推文线程化有什么好的启发式方法?

[英]What's a good set of heuristics for threading tweets?

Everyone knows, if you want to thread emails you use Jamie Zawinski's algorithm . 大家都知道,如果您想发送电子邮件,则可以使用Jamie Zawinski的算法 But it's a new century, and there's a new messaging service. 但这是一个新世纪,并且有了新的消息传递服务。

What's the best algorithm for threading status updates posted on Twitter? 在Twitter上发布状态更新的最佳算法是什么?

Things I'd definitely like it to cope with: 我绝对希望它处理的事情:

  • The easy part: using in_reply_to_status_id , in_reply_to_user_id and in_reply_to_screen_name . 简单的部分:使用in_reply_to_status_idin_reply_to_user_idin_reply_to_screen_name (Incidentally, finding proper documentation of these values would be useful in itself! Such documentation isn't obviously linked to from here , for example.) (顺便说一下,例如,找到这些值的适当文档本身将是有用的!例如,显然没有从此处链接此类文档。)

  • Good heuristics for inferring a "reply" relationship from messages that mention a user with the @ convention but aren't explicitly in reply to a particular message. 良好的试探法,可从提及使用@约定的用户的消息中推断出“回复”关系,但未明确答复特定消息。 These "mentions" are provided in the "entities" element of statuses now if you request that. 如果您要求, 现在状态的“实体”元素中提供这些“提及”。 These heuristics might take into account (a) the time between two status updates, (b) whether there are subsquent replies between the two users, etc. (Replies that consist of an old-style retweet with an additional comment, as mentioned by user85509 below are just an instance of this style of reply.) 这些试探法可能会考虑到(a)两个状态更新之间的时间间隔,(b)两个用户之间是否有后续答复,等等。(如由user85509提及的包含旧式转发和附加注释的回复)以下只是这种回复方式的一个实例。)

  • Conversations that take place between more than two users. 超过两个用户之间的对话。

  • Working with a set of tweets given to the algorithm, or all tweets on Twitter. 使用给该算法的一组推文,或Twitter上的所有推文。

... but perhaps you can think of more. ...但是也许您可以想到更多。

Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question. 由于只有一个答案,赏金截止日期快到了,我想我应该添加一个基准答案,这样赏金不会自动授予不会超出问题实质的答案。

The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id links to build many directed acyclic graphs. 显而易见的第一步是采用原始的一组推文,并遵循所有in_reply_to_status_id链接来构建许多有向无环图。 These relationships you can be nearly 100% sure about. 您可以几乎100%确定这些关系。 (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.) (您甚至应该通过原始集合中没有的推文来跟踪这些链接,并将其添加到您正在考虑的状态更新集中。)

Beyond that easy step, one has to do deal with the "mentions". 除了这一简单的步骤外,还必须处理“提法”。 Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. 与电子邮件线程化不同,没有什么比它可以匹配的主题行更有用了–这不可避免地非常容易出错。 The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option. 我将采用的方法是为可能由该推文中的提及表示的状态ID之间的每个可能关系创建一个特征向量,然后训练分类器猜测最佳选项,包括“不回复”选项。

To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id . 要计算“一切可能的关系”位,请考虑所有提及一个或多个其他用户且不包含in_reply_to_status_id状态更新。 Suppose an example of one of these tweets is: 1 假设这些推文之一的示例是: 1

@a @b no it isn't lol  RT @c Yes, absolutely. /cc @stephenfry

... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of @a , @b , @c , and @stephenfry for the last week (say) and one between that update and a special "no reply" update. ...您将创建此更新,并在时限提前每次更新之间的关系的特征向量@a@b@c@stephenfry在上周(说)和一个更新之间并进行特殊的“不回复”更新。 Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding: 然后,您必须创建一个特征向量-您可以根据需要添加此向量,但是我至少建议添加:

  • The time that elapsed between the two updates - presumably replies are more likely to be to recent updates. 两次更新之间经过的时间-大概是最近的更新。
  • The proportion of the way through the tweet in terms of words that a mention occurs. 通过提及的方式在推文中提及的比例。 eg if this is the first word, this would be a score of 0 and that's probably more likely to indicate a reply than mentions later in the update. 例如,如果这是第一个单词,则该分数将为0,并且与更新中稍后提及的内容相比,它更有可能表示答复。
  • The number of followers of the mentioned user - celebrities are presumably more likely to be spam-mentioned. 提到的用户的追随者数量-名人可能更可能被垃圾邮件提及。
  • The length of the longest common substring between the updates, which might indicate direct quoting. 更新之间最长的公共子字符串的长度,这可能表示直接引用。
  • Is the mention preceded by "/cc" or other signifiers that indicate that this isn't directly a reply to that person? 是否在提及前面加上“ / cc”或其他指示符,以表明这不是直接回复该人?
  • The following / followed ratio for the author of the original update. following / followed是原始更新作者的following / followed比率。
  • etc. 等等
  • etc. 等等

The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. 由于分类器将仅使用那些证明是有用的分类器,因此可以提出的越多越好。 I'd suggest trying a random forest classifier, which is conveniently implemented in Weka . 我建议尝试使用随机森林分类器,该分类器在Weka中方便地实现。

Next one needs a training set. 下一个需要训练集。 This can be small at first - just enough to get a service that identifies conversations up-and-running. 刚开始时,它可能很小—刚好足以获得识别正在运行的对话的服务。 To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. 对于此基本服务,必须添加一个不错的界面来更正不匹配或错误链接的更新,以便用户可以对其进行更正。 Using this data one can build a bigger training set and a more accurate classifier. 使用这些数据可以建立更大的训练集和更准确的分类器。

1 ... which might be typical of the level of discourse on Twitter ;) 1 ...这可能是Twitter上的典型讨论水平;)

在Twitter上,人们经常在要回复的消息前写“ RT”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM