简体繁体 English

从Sql迁移到NoSql数据库的限制

[英]Limits to move from Sql to NoSql Database

原文 2014-01-18 17:04:12 9 2 mysql/ database/ nosql/ hbase/ bigdata

We are facing performance related issues in our current MySQL DB. 我们在当前的MySQL数据库中面临与性能相关的问题。 Our application is pretty heavy on a few tables ~20. 我们的应用程序在大约20个表上非常繁琐。 We run lot of aggregation queries on this table as well as writes. 我们在此表以及写入上运行许多聚合查询。 Most of our teams are developers and we don't have access to a dba which might help in retuning our current db and make things work faster. 我们大多数团队都是开发人员，我们无权访问dba，这可能有助于重新调整当前的数据库并加快工作速度。

Moving to NoSql is an option. 迁移到NoSql是一种选择。 But seriously thinking what are the higher limits in terms of 但是认真思考在以下方面有哪些更高的限制

Volumes (Current volumes per day ~50GB) 卷（每天的当前卷〜50GB）
Structured or Raw Data? 结构化数据还是原始数据？ (Structured Data) （结构化数据）
IO stats on DB - ( Current rate is 60 KB/Sec) DB上的IO统计信息-（当前速率为60 KB /秒）
Record writes - (now 3000 rows/sec) 记录写入-（现在3000行/秒）

Question arise 问题出现

Is 50GB is high enough to consider NoSql? 50GB是否足以考虑NoSql？ Some documentation recommends more than a TB 一些文档推荐的不只是TB
The data should be raw data, which can be further processed to get structured and use in application 数据应该是原始数据，可以进一步处理以使其结构化并在应用程序中使用
MySql scales out at 3000 rows/secs, not sure MySql can be further tuned MySql以3000行/秒的速度扩展，不确定MySql是否可以进一步调整

HBase seems to be promising for Analytic application. HBase对于分析应用程序似乎很有希望。

Would like to get some guidelines on limits of RDBMS one can think of moving to NoSQL 想获得有关RDBMS限制的一些指导方针，可以想到转向NoSQL

2 个解决方案

This is such a broad topic so don't believe there are any "right" answers but maybe a few general recommendations would help: 这是一个如此广泛的主题，因此不要相信有任何“正确”的答案，但是也许一些一般性的建议会有所帮助：

I think you should think of this challenge in terms of picking the right tool for the problem. 我认为您应该从选择正确的工具这一角度来考虑这一挑战。 All databases have their pros and cons and in some challenges the best approach is to use an entire toolbox to get the job done. 所有数据库都有其优缺点，在某些挑战中，最好的方法是使用整个工具箱来完成工作。

Note that moving your data, or even just parts of it, to different datastores is rarely a non-trivial effort. 请注意，将数据或什至只是数据的一部分移至不同的数据存储区并不是一件容易的事。 Use this chance to rethink about your data model before implementing it. 在实施之前，利用这个机会重新考虑您的数据模型。

Getting this job done should also take into account more requirements, such your growth plans for example. 完成这项工作还应考虑更多要求，例如您的增长计划。 It looks you're at this crossroads because your original assumptions->choices are no longer en par with reality. 看起来您正处在这个十字路口，因为您原来的假设->选择不再与现实相提并论。 If you want to delay the next time you're at the same place, you should use this opportunity to do so. 如果您想推迟下一次到同一地点的时间，则应利用此机会。

Lastly keep in mind that the job really done only after you do something with all that captured data - or else I'd recommend you use the infinitely-scalable write-to-/dev/null design pattern ;) Put differently, unless your data is write-only, you'd want to make sure that whatever SQL/NoSQL/NewSQL/other datastore that you choose can also get you the data/information/knowledge inside your use case's acceptable time frames. 最后请记住，只有在对所有捕获的数据进行处理之后，该工作才真正完成-否则，我建议您使用无限可缩放的write-to / dev / null设计模式；）换句话说，除非您的数据如果是只写的，则要确保选择的任何SQL / NoSQL / NewSQL /其他数据存储区也能在用例的可接受时间范围内为您提供数据/信息/知识。

It will probably worth it given your current infrastructure, but keep in mind that it's going to be a huge task , since you're going to need to redesign the whole process . 鉴于您当前的基础架构， 这可能会值得 ，但是请记住， 这将是一项艰巨的任务 ，因为您将需要重新设计整个流程 。 HBase can help you, as it has some neat features, like realtime counters (which in some cases eliminates the needing of periodic rollups), or per-client buffering (which can allow you to scale to the >100k writes per second), but, be warned it cannot be queried in the same way you query a relational database, so, you're going to need to carefully plan it to make it work for you. HBase具有一些巧妙的功能，可以为您提供帮助，例如实时计数器（在某些情况下消除了对定期汇总的需要）或每客户端缓冲（可以使您扩展到每秒10万次写操作），但是请注意，不能以与查询关系数据库相同的方式来查询它，因此，您需要仔细计划它以使其适合您。

It seems that your main issue is with the raw data writes, sure, you can definitely rely on HBase for that, and then do the rollups every X min to store the data in your RDBMS so it can be queried as usual. 看来您的主要问题是原始数据写入，可以肯定的是，您可以绝对依赖HBase，然后每隔X分钟进行汇总以将数据存储在RDBMS中，以便可以照常查询。 But given you're doing them every minute, which is a very short gap, why don't you keep the data in memory and flush it the rolled up tables every minute?. 但是鉴于您每分钟都要做一次，这是一个非常短的间隔，为什么不将数据保留在内存中并每分钟刷新一次汇总表呢？ Sure, you could loss data, but I don't know how critic is for you loosing one minute of data, and that alone could help you a lot. 当然，您可能会丢失数据，但是我不知道批评者对您丢失一分钟的数据有何影响，仅此一项就可以为您带来很多帮助。

Anyway, the best advice I can think of: read a book, understand how HBase works first, dig into the pros & cons, and think about how it can suit your specific needings. 无论如何，我能想到的最佳建议是：读一本书，首先了解HBase的工作原理，挖掘其利弊，然后思考它如何满足您的特定需求。 This is crucial because a good implementation is what is going to determine if it's a success or a total failure. 这是至关重要的，因为一个好的实现将决定它是成功还是彻底失败。

Some resources: 一些资源：

HBase: The Definitive Guide HBase：权威指南

HBase Administration Cookbook HBase管理手册

HBase Reference guide (free) HBase参考指南（免费）