有什么方法可以优化MySQL中的复杂数学查询？

Question

I've been researching this forever and I am starting to think I can't make this any more efficient, but I wanted to ask if anyone had any tips. 我一直在研究这个问题，并且开始认为我无法再提高效率，但是我想问一下是否有人提出任何建议。

I'm running a query on millions of records to find all the x,y,z coordinates (these are stars) along a linear column from system a to system b with a given radius. 我正在对数百万条记录运行查询，以查找从给定半径的从系统a到系统b的线性列上的所有x，y，z坐标（这些都是星形）。 I'm running through PHP with a lot of other work being done on the result set. 我正在通过PHP运行，并且对结果集进行了许多其他工作。 I get results from the script in about 16 seconds. 我在大约16秒内从脚本获得了结果。 The query delay is about 7 of those 16 seconds. 查询延迟大约是那16秒中的7秒。

The basic query logic is: 基本查询逻辑是：

SELECT name, coordinates, and distance from end point
FROM stars
WHERE all stars are in a column of given radius between start and end points
ORDER BY distance from end point DESC

The where clause requires two separate calculations, they are this: where子句需要两个单独的计算，它们是：

Where calculation 1: 其中计算1：

Calculate if the stars are within the space of the column using constants and x,y,z

Where calculation 2: 其中计算2：

Limit the column radius to a given figure.
(This where clause also performs similar calculations with the same constants and x,y,z.)

The math formulas in the where clauses can't really be changed, they are the formula needed for columnar calculation in 3D space. where子句中的数学公式实际上无法更改，它们是3D空间中列式计算所需的公式。

The order by at the end of the query is absolutely necessary because the result set is too large for my script to hold in memory. 查询末尾的排序绝对必要，因为结果集太大，我的脚本无法保存在内存中。 I have to work with it in the proper order in the script. 我必须按照脚本中的正确顺序使用它。

The query is easiest to read as defined prior to variable substitution: 按照变量替换之前的定义，查询最容易读取：

SELECT
    name,
    x,
    y,
    z,
    SQRT(
        pow(`x`-" . $bx . ",2)+
        pow(`y`-" . $by . ",2)+
        pow(`z`-" . $bz . ",2)
    ) d
FROM
    stars
WHERE
    (((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) between 0 and 1
AND
    SQRT(((($ax + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cx))-`x`)*(($ax + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cx))-`x`))+((($ay + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cy))-`y`)*(($ay + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cy))-`y`))+((($az + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cz))-`z`)*(($az + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cz))-`z`)))
        <=$radius
ORDER BY
    SQRT(
        pow(`x`-" . $bx . ",2)+
        pow(`y`-" . $by . ",2)+
        pow(`z`-" . $bz . ",2)
    ) DESC

The final query run on the database looks like this: (For simplicity, I'm using sample data where a lot of the constants are 0.) 在数据库上运行的最终查询如下所示：（为简单起见，我使用的示例数据中许多常量均为0。）

SELECT
    name, 
    x, 
    y, 
    z, 
    SQRT( pow(`x`-25.21875,2)+ pow(`y`--20.90625,2)+ pow(`z`-25899.96875,2) ) d
FROM
    stars
WHERE
    (((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) 
    between 0 and 1
AND
    SQRT((((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25.21875))-`x`)*((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25.21875))-`x`))+(((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * -20.90625))-`y`)*((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * -20.90625))-`y`))+(((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25899.96875))-`z`)*((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25899.96875))-`z`)))
    <=600
ORDER BY
    SQRT( pow(`x`-25.21875,2)+ pow(`y`--20.90625,2)+ pow(`z`-25899.96875,2) ) DESC

My table definition looks like this: 我的表定义如下所示：

CREATE TABLE IF NOT EXISTS `stars` (
    `localkey` bigint(20) NOT NULL AUTO_INCREMENT,
    `id` bigint(20) NOT NULL,
    `x` double NOT NULL,
    `y` double NOT NULL,
    `z` double NOT NULL,
    `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`localkey`),
UNIQUE KEY `id` (`id`),
KEY `x` (`x`),
KEY `y` (`y`),
KEY `z` (`z`),
KEY `xyz` (`x`,`y`,`z`),
KEY `name` (`name`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

The explain for the query results indicates no index usage and an extra of: 查询结果的解释表明没有使用索引，并且还有以下内容：

extra: Using where; Using filesort;

What I've tried so far: 到目前为止，我已经尝试过：

Adjusting various data types to optimize memory usage and indexing (Even though my math makes it unlikely indexes will ever be used) 调整各种数据类型以优化内存使用率和索引编制（即使我的数学使它不太可能会使用索引）
Using a PHP loop and multiple smaller queries instead of this one huge one (It took longer with multiple queries.) 使用一个PHP循环和多个较小的查询，而不是一个庞大的查询（多个查询花费了更长的时间。）
Copying to a memory table before running query (Table is too large to fit in memory) 在运行查询之前复制到内存表（表太大而无法容纳在内存中）
Copying only part of the table (localkey, x, y, z) to memory. 仅将表的一部分（localkey，x，y，z）复制到内存。 (It fit, but left so little of max_heap_size for other processes it wasn't worth it.) （这很合适，但将max_heap_size留给其他进程的用处很小，这是不值得的。）

Are there other options I'm missing? 我还有其他选择吗？

Thanks! 谢谢！

Answer 1

Assuming that only a smaller subset of your records will match, you can reduce the math load by doing a basic "rectangular" filtering first. 假设只匹配较小的记录子集，则可以通过首先执行基本的“矩形”过滤来减少数学负担。 eg there's no point in performing a full cartesian distance for EVERY record in the table, only to throw away most of them. 例如，对于表中的每条记录执行完整的笛卡尔距离是没有意义的，只是要丢弃其中的大部分。

A simple "box" boundary check is just a simple subtraction and comparison: 一个简单的“盒子”边界检查只是一个简单的减法和比较：

SELECT ...
FROM (
    SELECT ...
    WHERE (
        (abs($x_coord - x_coordinate) <= $max_distance)
     OR (abs($y_coord - y_coordinate) <= $max_distance)
    )
) AS square_filter
WHERE ... full calculation here

Of cousre, you're doing 3d positions, so it's a bit more complicated, but this should give you the basic idea. 当然，您正在做3D位置，所以它要复杂一些，但这应该为您提供基本概念。

Answer 2

In addition to the excellent suggestion of fast pre-filtering that Marc B suggested above, when you do your second pass, you might save a little computing in the distance formula in two ways: 除了上面Marc B提出的关于快速预滤波的出色建议之外，当您进行第二次遍历时，您还可以通过两种方式在距离公式中节省一些计算：

1) use (xk) * (xk) instead of calling pow(xk, ...) 1）使用（xk）*（xk）而不是调用pow（xk，...）

2) skip the square root and calculate distance squared. 2）跳过平方根并计算平方的距离。 You would then compare to square of the distance that you need, which only needs to be calculated once. 然后，您可以将其与所需距离的平方进行比较，该距离只需计算一次即可。

Answer 3

Along with the two great mathematical optimizations suggested, the biggest jump in improving the speed would be form minimizing any computation and reducing your search space . 除了建议的两项重大数学优化外，提高速度的最大飞跃就是减少任何计算并减少搜索空间 。 That means Spatial Indexing. 这意味着空间索引。

I am not an expert in MySQL, but idea is you pre-generate spatial indexes in 3D space so your search space is dramatically reduced. 我不是MySQL方面的专家，但是我的想法是您在3D空间中预先生成空间索引，从而大大减少了搜索空间。

To be more precise, with a full table scan, your complexity becomes O(n^2) . 更精确地说，通过全表扫描，您的复杂度变为O（n ^ 2） 。 The time required to search increases with the size of table, and how far down the table you search. 搜索所需的时间随着表的大小以及搜索表的向下位置而增加。 However with tree based spatial index it can be reduced to O(n log n) 但是，基于树的空间索引可以将其减少为O（n log n）

Think about this dividing the space into fixed size cubes (and cubes within cubes). 考虑将空间划分为固定大小的多维数据集（以及多维数据集中的多维数据集）。 Not unlike how google map arranges tiles. 与Google地图如何排列图块没什么不同。 Now with the indexing, you have a "wormhole" to each cube, based on the initial coordinates, as you can calculate the cube you can find a star in with O(n) time. 现在，有了索引，您就可以基于初始坐标对每个多维数据集都有一个“虫洞”，因为您可以计算出可以找到O（n）时间的恒星的多维数据集。 Then all you have to do is run search in this cube. 然后，您要做的就是在此多维数据集中运行搜索。

Here's some reference in MySQL docs on spatial indexes. 这是MySQL文档中有关空间索引的一些参考。

I faced a similar problem dealing with LiDAR data in two coordinates few years back. 几年前，我在处理两个坐标中的LiDAR数据时遇到了类似的问题。 Here's the link to my question and answers that may help you get some ideas: 这是我的问题和答案的链接，可以帮助您获得一些想法：

https://gis.stackexchange.com/questions/12030/optimize-nearest-neighbor-query-on-70-million-point-cloud-on-sql-server-2008 https://gis.stackexchange.com/questions/12030/optimize-nearest-neighbor-query-on-70-million-point-cloud-on-sql-server-2008

Answer 4

The bounding-box approach might even use an index on one of the dimensions. 边界框方法甚至可以在其中一个维度上使用索引。 But not if written the way Marc suggests. 但是，如果不是按照马克的建议写的话。 Instead: 代替：

`x` BETWEEN $x - $dist AND $x + $dist

The general principle is that you should not hide an indexed variable in a function. 一般原则是，您不应在函数中隐藏索引变量。 ABS in this example. 在此示例中为ABS 。

Also... 也...

ORDER BY d -- this will avoid recomputing the SQRT

Does double minus in pow(y--20.90625,2) really work? pow(y--20.90625,2)双减号真的有效吗？ To fix it, swap them: 要解决此问题，请交换它们：

pow(-20.90625 - `y`,2)

It is messier to set up, but multiply might be faster than POW: 设置起来比较麻烦，但是乘法可能比POW快：

(-20.90625 - `y`) * (-20.90625 - `y`)

有什么方法可以优化MySQL中的复杂数学查询？

问题描述

4 个解决方案

解决方案1
3 2016-08-05 19:45:29

解决方案2
1 2016-08-05 19:56:15

解决方案3
1 2016-08-05 19:58:49

解决方案4
0 2016-08-05 20:06:16

有什么方法可以优化MySQL中的复杂数学查询？

问题描述

4 个解决方案

解决方案1 3 2016-08-05 19:45:29

解决方案2 1 2016-08-05 19:56:15

解决方案3 1 2016-08-05 19:58:49

解决方案4 0 2016-08-05 20:06:16

解决方案1
3 2016-08-05 19:45:29

解决方案2
1 2016-08-05 19:56:15

解决方案3
1 2016-08-05 19:58:49

解决方案4
0 2016-08-05 20:06:16