简体   繁体   English

NoSQL建议项目(文本流)

[英]NoSQL recommendation for a project (text streaming)

I'm looking for a NoSQL DB recommendation... here's what I'm working on: 我正在寻找NoSQL DB建议...这是我正在从事的工作:

I'm writing a web-based client for delivering text streams (basically, real-time captions) to a significant number of consumers. 我正在编写一个基于Web的客户端,用于向大量消费者提供文本流(基本上是实时字幕)。 Once things are fully ramped up, there might be 100+ events happening at any given moment. 一旦一切准备就绪,在任何给定时刻都可能发生100多个事件。 Many will be small (< 10 consumers) but some of them could be quite large (10,000+ simultaneous consumers, maybe more?). 许多将是小型的(少于10个消费者),但其中一些可能会很大(10,000多个同时的消费者,也许更多?)。

During the course of each event, text will be accumulating at a rate of anywhere from a few words per minute up to 200+ words per minute. 在每个事件的过程中,文本将以每分钟几个单词到每分钟200个以上单词的任意速率累积。 Each consumer will be running a web client (a browser on a desktop/laptop/tablet/smartphone) which will poll periodically for any text that it hasn't already received. 每个消费者都将运行一个Web客户端(台式机/笔记本电脑/平板电脑/智能手机上的浏览器),该客户端将定期轮询尚未收到的任何文本。 It will also be possible for a given user to ask for the full text of the event up to the time that they make the request. 给定用户在提出请求之前也可以要求提供事件的全文。 Completed events have to stick around for a while, but will be removed within about 24-36 hours of their completion. 已完成的活动必须停留一段时间,但会在完成后的24-36小时内将其删除。

My first thought is to use Redis, which has methods for appending to a text value in the datastore as well as built-in support for getting a substring from the end of a text value (ie a client could just hold the character offset of the last character it received and would pass that to the client API and that would be used to pull a substring from the event text). 我的第一个想法是使用Redis,它具有在数据存储区中附加到文本值的方法,以及对从文本值的末尾获取子字符串的内置支持(即,客户端可以只保留字符的偏移量)。最后一个字符,并将其传递给客户端API,并将其用于从事件文本中提取子字符串)。 I am concerned though that the growth of the string containing the event text might be an unusual use of Redis and could cause me some issues. 不过,我担心包含事件文本的字符串的增长可能是Redis的不寻常用法,并且可能导致我遇到一些问题。

So... is there a NoSQL DB that seems particularly well suited to this sort of application? 那么...是否有NoSQL DB似乎特别适合此类应用程序? Is there any significant reason NOT to use Redis for something like this? 有什么重要的理由不使用Redis这样的东西吗?

An underlying open question is what to do about new clients. 一个基本的未解决问题是如何处理新客户。 For example, say an event has started and someone connects a few minutes into it. 例如,假设某个事件已经开始,并且有人将其连接了几分钟。 Do they need everything from the beginning or just from when they connected? 他们从一开始还是从连接时就需要一切?

If the latter I'd recommend a message system instead of appending strings to strings. 如果是后者,我建议您使用消息系统,而不是在字符串后附加字符串。 One way would be to use Redis' Pub/Sub instead. 一种方法是改为使用Redis的发布/订阅 That seems a better fit overall, and especially if new connections do not need everything from the beginning. 总体而言,这似乎是一个更好的选择,尤其是如果新连接从一开始就不需要所有内容。 For longer term storage, a client that listens as any other and the archive entry - preferably by local cache and then upload the completed transcript when completed or in-progress. 对于长期存储,客户端最好像本地缓存一样侦听其他条目和存档条目,然后在完成或进行中时上传完成的脚本。 I'd keep the real-time need and code separate from requesting history and archives. 我会将实时需求和代码与请求历史记录和存档分开。

Another route would be to use an ordered set, using a timestamp for the time the entry was made. 另一种方法是使用有序集,并在输入条目时使用时间戳。 As a result the client only keeps track of the last update and retrieves anything from that time on. 结果,客户端仅跟踪最近的更新并从那时起检索任何内容。 Ordered Sets documentation can be found here . 订购集文档可在此处找到。 This method also provides the ability to select a region of time from the transcript. 此方法还提供了从笔录中选择时间范围的功能。 With a bit of math you could even replay the event from a transcript viewpoint as if it were live. 借助一点数学,您甚至可以从成绩单角度重播事件,就好像它是实时的一样。 If you've got tens of thousands of clients pulling the entire transcript each poll 如果您有成千上万的客户每次投票都提取整个笔录

Another advantage of the timestamp ordered set is string encoding. 时间戳顺序集的另一个优点是字符串编码。 When using Redis strings and getrange you have to use fixed-width encodings. 使用Redis字符串和getrange时 ,必须使用固定宽度的编码。 The range is byte-offsets, not character offsets. 范围是字节偏移量,而不是字符偏移量。 If you need the ability to support, say UTF-8, this might be a problem for you. 如果您需要支持的能力(例如UTF-8),这可能对您来说是个问题。

A third option is to append a string of text to a list. 第三种选择是将文本字符串附加到列表中。 This is similar to the sorted set except that your client stores the last index (size of the list) and on each poll tries to get anything from lastIndex+1 to the end. 这与排序集相似,不同之处在于您的客户端存储了最后一个索引(列表的大小),并且在每次轮询时都尝试从lastIndex + 1到末尾获取任何内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM