简体   繁体   English

使用两个索引将数据存储在文件中

[英]Store data in the file with two indexes

I am looking for a way to store a big amount of data in the file or files.我正在寻找一种在文件中存储大量数据的方法。 The additional requirement is: it should be indexed, two indexes on integer fields should allow selecting a specific set of data very fast.附加要求是:它应该被索引,integer 字段上的两个索引应该允许非常快速地选择一组特定的数据。

Details: the data record is a fixed-length set of 3 integers like this:详细信息:数据记录是一组固定长度的 3 个整数,如下所示:

A (int) |一个(整数)| B (int) | B (整数) | N (int) N(整数)

A and B are indexable columns while N is just a data value. A 和 B 是可索引列,而 N 只是一个数据值。

This data set may contain billions of records (for example 30M) and there should be a way to select all records with A= as fast as possible.该数据集可能包含数十亿条记录(例如 30M),并且应该有一种方法可以尽快 select 所有具有 A= 的记录。 Or records with B= as fast as possible.或者尽可能快地用 B= 记录。

I can not use any other technologies except MySQL and PHP and you can say: "Wow, you can use MySQL.".除了 MySQL 和 PHP 之外,我不能使用任何其他技术,你可以说:“哇,你可以使用 MySQL。”。 Sure, I am already using it, but because of MySQL's extra data, my database takes 10 times more space than it should.当然,我已经在使用它,但是由于 MySQL 的额外数据,我的数据库占用的空间比它应该占用的空间多 10 倍。 plus index data.加上索引数据。

So I am looking for a file-based solution.所以我正在寻找基于文件的解决方案。

Are there any ready algorithms to implement this?有没有现成的算法来实现这个? Or source code solution?还是源码解决方案?

Thank you!谢谢!

Update 1:更新1:

CREATE TABLE `w_vectors` (
    `wid` int(11) NOT NULL,
    `did` int(11) NOT NULL,
    `wn` int(11) NOT NULL DEFAULT '0',
    UNIQUE KEY `did_wn` (`did`,`wn`),
    KEY `wid` (`wid`),
    KEY `did` (`did`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci

Update 2:更新 2:

The goal of this table is to store document-vs-words vectors for a word-based search application.此表的目标是存储基于单词的搜索应用程序的文档与单词向量。 This table stores all the words from all the documents in compact form (wid is the word ID from the word vocabulary, did is the document ID and wn is the number of the word in the document).该表以紧凑的形式存储所有文档中的所有单词(wid 是单词词汇表中的单词 ID,did 是文档 ID,wn 是文档中单词的编号)。 This works pretty well, however, in case you have, let's say, 1000000 documents, each document contains average of 10k words, this table becomes VERY VERY huge like 10 billion rows.这很好用,但是,如果你有 1000000 个文档,每个文档平均包含 10k 个单词,这个表就会变得非常大,比如 100 亿行。 And with row size 34 bytes it becomes a 340 Gb structure for just 1 million documents.., not good?行大小为 34 字节时,它变成了一个 340 Gb 的结构,仅用于 100 万个文档......,不好吗? right?正确的?

I am looking for a way to optimize this.我正在寻找一种优化它的方法。

If you must use MySQL, you could try:如果您必须使用 MySQL,您可以尝试:

  • Convert the table to MyISAM, which takes less space than InnoDB, and allows multiple indexes per table.将表转换为 MyISAM,它比 InnoDB 占用更少的空间,并且允许每个表有多个索引。 I rarely recommend MyISAM because it doesn't support ACID properties.我很少推荐 MyISAM,因为它不支持 ACID 属性。 But if your option is to use a file-based solution, that won't support ACID either.但是,如果您选择使用基于文件的解决方案,那么它也不支持 ACID。

  • Use one of the various solutions for compressed data in MySQL.使用 MySQL 中压缩数据的各种解决方案之一。 There's a nice comparison here: https://www.percona.com/blog/2018/11/23/compression-options-in-mysql-part-1/这里有一个很好的比较: https://www.percona.com/blog/2018/11/23/compression-options-in-mysql-part-1/

You may as well change你也可以改变

UNIQUE KEY `did_wn` (`did`,`wn`)

to

PRIMARY KEY(did, wn)

and get rid of并摆脱

INDEX(did)

since that composite index takes care of queries to did .因为该复合索引负责对did的查询。

With that PK, these will be very efficient:有了那个PK,这些将非常有效:

... WHERE did = 123
... WHERE did = 123 AND wn = 456
... WHERE wn = 456 AND did = 123

Meanwhile, your INDEX(wid) benefits any WHERE clause that tests for a single value of wid or a range of wids.同时,您的INDEX(wid)使任何测试单个 wid 值或一系列 wid 的WHERE子句受益。

Since I don't know about your original A and B , I can't answer your question in terms of the real column names.由于我不知道您原来AB ,因此我无法根据真实的列名回答您的问题。 Anyway:反正:

there should be a way to select all records with A= as fast as possible.应该有一种方法可以尽快 select 所有带有 A= 的记录。 Or records with B= as fast as possible.或者尽可能快地用 B= 记录。

For those, you need对于那些,你需要

INDEX(A)  -- or any index _starting with_ A
INDEX(B)  -- or any index _starting with_ B

But if either of those is did , don't add it.但是,如果其中任何一个是did ,请不要添加它。 (The PK will take care of making it fast. (PK 将负责使其快速。

Also, use InnoDB, not MyISAM.另外,使用 InnoDB,而不是 MyISAM。 Alas, that leads to "10 times more space than it should" in your case.唉,在您的情况下,这会导致“空间比应有的空间多 10 倍”。 If you choose to use MyISAM, I will need to start over on index recommendations.如果您选择使用 MyISAM,我将需要重新开始索引建议。

Once you map A and B to the column names, I'll give you one more tip.一旦你 map A 和 B 到列名,我再给你一个提示。

More discussion of indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql更多索引讨论: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM