简体繁体 English

大型数据集的数据库引擎

[英]Which Database engine for large dataset

原文 2017-10-08 16:52:55 6 2 mysql/ elasticsearch/ relational-database/ wide-column-store

I'm working on a analysis assignment, we got a partial data-set from the university Library containing almost 300.000.000 rows. 我正在进行分析任务，我们从大学图书馆获得了部分数据集，其中包含近300.000.000行。

Each row contains: 每行包含：

ID ID
Date 日期
Owner 所有者
Deadline 最后期限
Checkout_date 离开日期
Checkin_date 登记日期

I put all this inside a MySQL table, then I started querying that for my analysis assignment, however simple query ( SELECT * FROM table WHERE ID = something ) where taking 9-10 minutes to complete. 我将所有这些内容放入MySQL表中，然后开始查询以进行我的分析任务，但是需要简单的查询（ SELECT * FROM table WHERE ID = something ），需要9到10分钟才能完成。 So I created an index for all the columns, which made it noticeable faster ~ 30 sec. 因此，我为所有列创建了索引，这使其索引速度提高了约30秒。

So I started reading similar issues, and people recommended switching to a "Wide column store" or "Search engine" instead of "Relational". 因此，我开始阅读类似的问题，人们建议切换到“宽列存储”或“搜索引擎”，而不是“关系”。

So my question is, what would be the best database engine to use for this data? 所以我的问题是，用于此数据的最佳数据库引擎是什么？

2 个解决方案

Using a search engine to search is IMO the best option. 使用搜索引擎进行搜索是IMO的最佳选择。

Elasticsearch of course! Elasticsearch当然！

Disclaimer: I work at elastic. 免责声明：我从事弹性工作。 :) :)

The answer is, of course, "it depends". 答案当然是“取决于”。 In your example, you're counting the number of records in the database with a given ID. 在您的示例中，您要计算具有给定ID的数据库中的记录数。 I find it hard to believe that it would take 30 seconds in MySQL, unless you're on some sluggish laptop. 我很难相信，除非您使用的是笔记本电脑，否则在MySQL中这将花费30秒。

MySQL has powered an incredible number of systems because it is full-featured, stable, and has pretty good performance. MySQL功能强大，功能稳定，性能相当好，因此它为数量众多的系统提供了强大的动力。 It's bad (or has been bad) at some things, like text search, clustering, etc. 在某些方面，这很不好（或者很糟糕），例如文本搜索，聚类等。

Systems like Elasticsearch are good with globs of text, but still may not be a good fit for your system, depending on usage. 诸如Elasticsearch之类的系统可以处理大量的文本，但根据使用情况，仍然可能不适合您的系统。 From your schema, you have one text field ("owner"), and you wouldn't need Elasticsearch's text searching capabilities on a field like that (who ever needed to stem a user name?). 从您的模式中，您有一个文本字段（“所有者”），并且不需要像这样的字段（谁需要阻止用户名？）上的Elasticsearch文本搜索功能。 Elasticsearch is also used widely for log files, which also don't need a text engine. Elasticsearch还广泛用于日志文件，该日志文件也不需要文本引擎。 It is, however, good with blocks of text and with with clustering. 但是，这对于文本块和聚类来说是很好的。

If this is a class assignment, I'd stick with MySQL. 如果这是一个课堂作业，我会坚持使用MySQL。