简体   繁体   English

B+tree 如何处理 AND、OR、IN 和 equals 的组合?

[英]How does a B+tree handle a combination of AND, OR, IN and equals?

How do these 4 types of queries take advantage of indexes?这 4 种查询如何利用索引? What does the scan look like?扫描结果如何?

WHERE status = "foo"

WHERE id IN (1, 2, 3)

WHERE id IN (1, 2, 3) AND status = "foo"

WHERE id IN (1, 2, 3) OR status = "foo"

In the first case, I think this is a B+tree with the key being the status.在第一种情况下,我认为这是一个 B+树,关键是状态。 Easy enough.很容易。 But wait, it needs to store multiple items per status, so maybe it has an array (generally speaking) of the records for each status.但是等等,它需要为每个状态存储多个项目,所以它可能有一个数组(一般来说)每个状态的记录。

But for the second query, it seems you would just have the index be for id and just fetch from the B+tree each key one id at a time, so it would do tree.get(id) for each id .但是对于第二个查询,您似乎只需将索引设置为id并从 B+tree 中一次获取一个id的每个键,因此它会为每个id执行tree.get(id) But that is already seeming less than ideal.但这似乎已经不太理想了。 How is it actually done?它实际上是如何完成的?

Then take it further and combine the two query types, you can only use one of the indexes now (say the id index, not the status index).然后再进一步结合这两种查询类型,你现在只能使用其中一种索引(比如id索引,而不是status索引)。 Then you get the subset of records matching these IDs, and iterate through them and check the status.然后,您获取与这些 ID 匹配的记录子集,并遍历它们并检查状态。

Now we are starting to seem really inefficient.现在我们开始显得效率低下。

Same with the OR query.与 OR 查询相同。

How are these typically implemented in a database, generally or ideally speaking?一般来说或理想情况下,这些通常如何在数据库中实现?

I am asking because I would like to implement a basic version of this in JavaScript for the browser.我问是因为我想在 JavaScript 中为浏览器实现一个基本版本。 Basically, what the best way is to have multiple (potentially multi-columned) indexes on a table.基本上,最好的方法是在一个表上有多个(可能是多列)索引。 So I can store a record in this "table", it gets stored in every index, and then on a query it fetches from the "best" index.所以我可以在这个“表”中存储一条记录,它存储在每个索引中,然后在查询中从“最佳”索引中获取。 I am not really sure how this works at a high level (high level yet very deep in terms of data-structure/algorithm implementation) to get started.我不太确定这是如何在高层次(高层次但在数据结构/算法实现方面非常深入)开始工作的。

This is the template I am basically starting with:这是我基本上开始的模板:

class Index {
  constructor(fields = ['id']) {
    this.fields = fields
    this.tree = new Tree
  }

  insert(record) {
    this.tree.insert(this.getKey(record), block)
  }

  remove(record) {
    this.tree.remove(this.getKey(record))
  }

  check(record) {
    return this.tree.check(this.getKey(record))
  }

  getKey(record) {
    return this.fields.map(field => record[field]).join('')
  }
}

class Table {
  constructor() {
    this.index = []
  }

  insert(record) {
    this.index.forEach(index => index.insert(record))
  }

  select(query) {
    // query processing
  }

  remove(id) {
    
  }
}

So basically, for each table you create several indexes.所以基本上,为每个表创建几个索引。 When you insert a record, it gets the key for each index and inserts it into a Tree (the B+tree that acts like a key/value store).当您插入一条记录时,它会获取每个索引的键并将其插入到Tree中(就像键/值存储一样的 B+树)。 From there I don't know how to properly use the indexes, or if I'm even on the right track.从那里我不知道如何正确使用索引,或者我是否在正确的轨道上。 I would ask how an ideal relational database would implement this, but that would likely get downvoted as being too general:/ but that's what I'm actually trying to build.我会问一个理想的关系数据库将如何实现这一点,但这可能会因为过于笼统而被否决:/但这正是我真正想要构建的。

I have this B+tree as an example to work with.我以这个 B+tree为例。

You don't seem to be restricted in the indexes you can have, so lets assume you have an index on (id) and an index on (status, id).您似乎没有受到可以拥有的索引的限制,因此假设您在 (id) 上有一个索引,在 (status, id) 上有一个索引。 I'm also going to assume that id is a primary key or has a uniqueness constraint, as IDs usually do:我还将假设 id 是主键或具有唯一性约束,就像 ID 通常那样:

WHERE status = "foo"

The range of items that match the status is efficiently read out of the (status,id) index.从 (status,id) 索引中有效地读取与状态匹配的项目范围。

WHERE id IN (1, 2, 3)

Assuming id is an integral type, the range of items with id >=1 and <=3 is read out of the (id) index.假设 id 是整数类型,则从 (id) 索引中读取 id >=1 和 <=3 的项目范围。 The index is ordered and finding a range of consecutive values is no more difficult than finding a single value.索引是有序的,查找一系列连续值并不比查找单个值难。

WHERE id IN (1, 2, 3) AND status = "foo"

This matches a consecutive range in the (status, id) index.这匹配 (status, id) 索引中的连续范围。

WHERE id IN (1, 2, 3) OR status = "foo"

The (1,2,3) range is selected from the (id) index and the "foo" range is selected from the (status, id) index. (1,2,3) 范围是从 (id) 索引中选择的,“foo”范围是从 (status, id) 索引中选择的。 The results are then merged.然后合并结果。 Since both ranges have distinct rows in the same order, they can be merged efficiently like the merge operation in merge sort.由于两个范围具有相同顺序的不同行,因此可以像合并排序中的合并操作一样有效地合并它们。


If you want to be able to do the same sorts of things with your own index class, you need to support indexes on multiple columns, and you need to be able to get an iterator for the rows in the index, starting at a given key.如果你想用你自己的索引 class 做同样的事情,你需要支持多列的索引,你需要能够从给定的键开始获取索引中行的迭代器.

I'll address this specifically for MySQL/MariaDB.我将专门针对 MySQL/MariaDB 解决这个问题。 The specifics may vary with other vendors.具体情况可能因其他供应商而异。 I have changed away from "1,2,3" to avoid the temptation to assume the values are consecutive.我已经改变了“1,2,3”以避免假设这些值是连续的。 I am also changing away from "id" because id is the PRIMARY KEY .我也改变了“id”,因为idPRIMARY KEY

MySQL will use a B+Tree. MySQL 将使用 B+树。

WHERE status = "foo"
    INDEX(status)       -- best
    INDEX(status, ...)  -- nearly as good
    If a nontrivial number of rows have "foo", it won't bother using any index!

WHERE bar IN (123, 456, 789)
    INDEX(bar)  -- It will do multiple BTree lookups.

WHERE bar IN (123, 456, 789) AND status = "foo"
    INDEX(status, bar)   -- In this order

WHERE bar IN (123, 456, 789) OR status = "foo"
    No index is likely to be beneficial; it will do a table scan.
    It would probably run faster to use two SELECTs and a UNION

If you need to do all 4 queries, then I recommend having these two indexes:如果您需要执行所有 4 个查询,那么我建议您使用以下两个索引:

    INDEX(status, bar)  -- helps 1st and 3rd
    INDEX(bar)          -- helps 2nd

Think of concatenating the columns, then using that as a single key into the BTree.考虑连接列,然后将其用作 BTree 的单个键。 (This will keep you from getting distracted by "cardinality" or "selectivity" of the individual columns.) (这将使您不会因各个列的“基数”或“选择性”而分心。)

This does not get into "clustering" and "index merge" and many other topics.这不涉及“集群”和“索引合并”等许多主题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM