简体繁体 English

Twitter Streaming API多流与自定义过滤器

[英]Twitter Streaming API Multiple Stream vs Custom Filter

原文 2013-01-03 09:28:52 0 3 javascript/ node.js/ twitter

I'm building a node.js application that opens up a connection to the Twitter Streaming API (v1.1) 我正在构建一个node.js应用程序，它打开了与Twitter Streaming API（v1.1）的连接

I would like to filter multiple keywords (hashtags & words) as separate queries. 我想将多个关键字（主题标签和单词）过滤为单独的查询。 My original idea was to have multiple public streams. 我最初的想法是拥有多个公共流。

However, I understand that I can only have one open connection to the Twitter streaming api per application and per IP address and that Twitter encourages us to come up with creative solutions to get what we want. 但是，据我所知，我只能通过每个应用程序和每个IP地址与Twitter流式api建立一个开放的连接，并且Twitter鼓励我们提出创造性的解决方案来获得我们想要的东西。

So my question is this: 所以我的问题是：

If I stream with no filters, such as using statuses/sample (which I believe is 1%) and use custom javascript to filter the output, would I get the same tweets if I used the API method of filtering (ie track='twitter'). 如果我没有过滤器流，例如使用状态/样本（我认为是1％）并使用自定义javascript来过滤输出，如果我使用过滤的API方法（即track ='twitter），我会得到相同的推文“）。

Edit: I have created a diagram explaining this: 编辑：我创建了一个解释这个的图表：

在此输入图像描述

As you can see, I want to know if the two outputs wil be the same. 如您所见，我想知道两个输出是否相同。 I suspect that they won't be because although both outputs are effectively the same filter, one source is a 1% sample, and maybe the other source is a 100% sample but only delivering 1% tweets from that. 我怀疑它们不会是因为虽然两个输出实际上都是相同的过滤器，但是一个来源是1％的样本，也许另一个来源是100％的样本，但只提供1％的推文。

So can someone please clarify if both outputs are the same? 那么有人可以澄清两个输出是否相同？

Thank you. 谢谢。

3 个解决方案

According to the Twitter streaming api rules, if the keywords that you track doesn't exceed 1% of the whole global traffic you will receive all data (some tweets might be lost due to network issues etc but it is not significant). 根据Twitter流式api规则，如果您跟踪的关键字不超过整个全局流量的1％，您将收到所有数据（某些推文可能因网络问题等而丢失，但并不重要）。 This is called garden-hose (firehose is a special filter which gives you all the data but it is given as a paid service through third parties such as http://datasift.com/ ) 这被称为花园软管（firehose是一种特殊的过滤器，可以为您提供所有数据，但它是通过第三方提供的付费服务，例如http://datasift.com/ ）

So if a tweet is filtered through public stream then it would be part of your custom filter too unless your keyword set is too broad. 因此，如果通过公共信息流过滤推文，那么除非您的关键字设置过于宽泛，否则它也会成为自定义过滤器的一部分。

By using custom filters you can track multiple search keywords, and if you miss some data because your keyword set is too broad twitter sends a track limitation notice indicating how much data you are missing. 通过使用自定义过滤器，您可以跟踪多个搜索关键字，如果您错过了一些数据，因为您的关键字设置过于宽泛，Twitter会发送一个跟踪限制通知，指出您丢失了多少数据。

My suggestion to you would be to use a custom filter and analyze what you get from the stream and what you get as a result for the same keywords from twitter. 我的建议是使用自定义过滤器并分析从流中获得的内容以及从twitter获得相同关键字的结果。 And when you start getting track limitation notice from twitter, it is time for you to split your keyword set into chunks and start streaming through different streamers by running them from different machines. 当您开始从Twitter获取跟踪限制通知时，您现在是时候将关键字集拆分为块并通过从不同的计算机运行来开始通过不同的流式传输流。

The details of the filter streaming is below (taken from official website https://dev.twitter.com/docs/api/1.1/post/statuses/filter ) 过滤流的详细信息如下（摘自官方网站https://dev.twitter.com/docs/api/1.1/post/statuses/filter ）

Returns public statuses that match one or more filter predicates. 返回与一个或多个过滤谓词匹配的公共状态。 Multiple parameters may be specified which allows most clients to use a single connection to the Streaming API. 可以指定多个参数，允许大多数客户端使用与Streaming API的单个连接。 Both GET and POST requests are supported, but GET requests with too many parameters may cause the request to be rejected for excessive URL length. 支持GET和POST请求，但是参数太多的GET请求可能会导致请求因URL长度过长而被拒绝。 Use a POST request to avoid long URLs. 使用POST请求以避免长URL。

The default access level allows up to 400 track keywords, 5,000 follow userids and 25 0.1-360 degree location boxes. 默认访问级别最多允许400个跟踪关键字，5,000个跟随用户ID和25个0.1-360度位置框。 If you need elevated access to the Streaming API, you should explore our partner providers of Twitter data here. 如果您需要提升对Streaming API的访问权限，您应该在此处探索我们的Twitter数据合作伙伴提供商。

I would like to answer my question with the results of my findings. 我想用我的调查结果回答我的问题。

I tested both side by side in the same time frame and concluded that the custom filter method, whilst it supports multiple filters does not provide enough tweets to create an interesting enough visualisation. 我在同一时间框架内并排测试并得出结论，自定义过滤器方法，虽然它支持多个过滤器但没有提供足够的推文来创建足够有趣的可视化。

I think the only way to get something more interesting with concurrent filters is to look at other methods but I am wondering if its not possible. 我认为使用并发过滤器获得更有趣的东西的唯一方法是查看其他方法，但我想知道它是否不可能。 Maybe with a third party. 也许与第三方。

I have attached a screenshot of the visualisation tracking 'barackobama' The left is the custom filter, the right is statuses/filter. 我附上了可视化跟踪'barackobama'的屏幕截图。左边是自定义过滤器，右边是状态/过滤器。

在此输入图像描述

The statuses/filter api operate on all tweets, instead of those returned by statuses/sample , you can tell by looking at their tweet id's: sample tweets all come from a specific time window. statuses/filter api对所有推文起作用，而不是statuses/sample返回的那些，你可以通过查看他们的推文ID来判断：样本推文都来自特定的时间窗口。 So from millisecond-resolution creation time, you can definitely tell that filter returns tweets outside of sample . 因此，从毫秒分辨率的创建时间，您可以肯定地告诉filter返回sample之外的推文。

For more details about getting creation time from tweet id and the time window on sample tweets, consult this post: http://blog.falcondai.com/2013/06/666-and-how-twitter-samples-tweets-in.html 有关从推文ID和示例推文的时间窗口获取创建时间的更多详细信息，请参阅以下文章： http ： //blog.falcondai.com/2013/06/666-and-how-twitter-samples-tweets-in。 HTML