How can I optimize this stored procedure?

Question

I need some help optimizing this procedure:

DELIMITER $$

CREATE DEFINER=`ryan`@`%` PROCEDURE `GetCitiesInRadius`(
    cityID  numeric (15), 
    `range`  numeric (15)
)
BEGIN 
    DECLARE lat1  decimal (5,2);
    DECLARE long1  decimal (5,2);
    DECLARE rangeFactor  decimal (7,6);
    SET rangeFactor = 0.014457;
    SELECT `latitude`,`longitude` into  lat1,long1
    FROM  world_cities as wc WHERE city_id = cityID;

    SELECT 
        wc.city_id, 
        wc.accent_city as city, 
        s.state_name as state, 
        c.short_name as country,
        GetDistance(lat1, long1, wc.`latitude`, wc.`longitude`) as dist
        FROM  world_cities as wc
        left join states s on wc.state_id = s.state_id
        left join countries c on wc.country_id = c.country_id
        WHERE
        wc.`latitude` BETWEEN lat1 -(`range` * rangeFactor) AND lat1 + (`range` * rangeFactor)
        AND wc.`longitude` BETWEEN long1 - (`range` * rangeFactor) AND long1 + (`range` * rangeFactor)
        AND GetDistance(lat1, long1, wc.`latitude`, wc.`longitude`) <= `range`
        ORDER BY dist limit 6;
END

Here is my explain on the main portion of the query:

+----+-------------+-------+--------+---------------+--------------+---------+--------------------------+------+----------------------------------------------+
| id | select_type | table | type   | possible_keys | key          | key_len | ref                      | rows | Extra                                        |
+----+-------------+-------+--------+---------------+--------------+---------+--------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | B     | range  | idx_lat_long  | idx_lat_long | 12      | NULL                     | 7619 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | s     | eq_ref | PRIMARY       | PRIMARY      | 4       | civilipedia.B.state_id   |    1 |                                              |
|  1 | SIMPLE      | c     | eq_ref | PRIMARY       | PRIMARY      | 1       | civilipedia.B.country_id |    1 | Using where                                  |
+----+-------------+-------+--------+---------------+--------------+---------+--------------------------+------+----------------------------------------------+
3 rows in set (0.00 sec)

Here are the indexes:

mysql> show indexes from world_cities;
+--------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name      | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| world_cities |          0 | PRIMARY       |            1 | city_id     | A         |     3173958 |     NULL | NULL   |      | BTREE      |         |
| world_cities |          1 | country_id    |            1 | country_id  | A         |       23510 |     NULL | NULL   | YES  | BTREE      |         |
| world_cities |          1 | city          |            1 | city        | A         |     3173958 |     NULL | NULL   | YES  | BTREE      |         |
| world_cities |          1 | accent_city   |            1 | accent_city | A         |     3173958 |     NULL | NULL   | YES  | BTREE      |         |
| world_cities |          1 | idx_pop       |            1 | population  | A         |       28854 |     NULL | NULL   | YES  | BTREE      |         |
| world_cities |          1 | idx_lat_long  |            1 | latitude    | A         |     1057986 |     NULL | NULL   | YES  | BTREE      |         |
| world_cities |          1 | idx_lat_long  |            2 | longitude   | A         |     3173958 |     NULL | NULL   | YES  | BTREE      |         |
| world_cities |          1 | accent_city_2 |            1 | accent_city | NULL      |     1586979 |     NULL | NULL   | YES  | FULLTEXT   |         |
+--------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
8 rows in set (0.01 sec)

The function you see in the query I wouldn't think would cause the slow down, but here is the function:

CREATE DEFINER=`ryan`@`%` FUNCTION `GetDistance`(lat1  numeric (9,6),
    lon1  numeric (9,6), 
    lat2  numeric (9,6),
    lon2  numeric (9,6)  ) RETURNS decimal(10,5)
BEGIN 
    DECLARE  x  decimal (20,10);
    DECLARE  pi  decimal (21,20); 
    SET  pi = 3.14159265358979323846; 
    SET  x = sin( lat1 * pi/180 ) * sin( lat2 * pi/180  ) + cos( 
        lat1 *pi/180 ) * cos( lat2 * pi/180 ) * cos( (lon2 * pi/180) -
        (lon1 *pi/180)
    );
    SET  x = atan( ( sqrt( 1- power( x, 2 ) ) ) / x );
    RETURN  ( 1.852 * 60.0 * ((x/pi)*180) ) / 1.609344;
END

Answer 1

As far as I can tell there is not something directly wrong with your logic that would make this slow, so the problems ends up being that you can't use any indexes with this query.

MySQL needs to do a full table scan and apply the functions of your WHERE clause to each row to determine if it passed the conditions. Currently there's 1 index used: idx_lat_long .

It's a bit of a bad index, the long portion will never be used, because the lat portion is a float. But at the very least you managed to effectively filter out all rows that are outside the latitude range. But it's likely.. these are still a lot though.

You'd actually get slightly better results on the longitude, because humans only really live in the middle 30% of the earth. We're very much spread out horizontally, but not really vertically.

Regardless, the best way to further minimize the field is to try to filter out as many records in the general area. Right now it's a full vertical strip on the earth, try to make it a bounding box.

You could naively dice up the earth in say, 10x10 segments. This would in a best case make sure the query is limited to 10% of the earth ;).

But as soon as your bounding box exceeds to separate segments, only the first coordinate (lat or lng) can be used in the index and you end up with the same problem.

So when I thought of this problem I started thinking about this differently. Instead, I divided up the earth in 4 segments (lets say, north east, north west, south east, south west on map). So this gives me coordinates like:

0,0
0,1
1,0
1,1

Instead of putting the x and y value in 2 separate fields, I used it as a bit field and store both at once.

Then every 1 of the 4 boxes I divided up again, which gives us 2 sets of coordinates. The outer and inner coordinates. I'm still encoding this in the same field, which means we now use 4 bits for our 8x8 coordinate system.

How far can we go? If we assume a 64 bit integer field, it means that 32bit can be used for each of the 2 coordinates. This gives us a grid system of 4294967295 x 4294967295 all encoded into one database field.

The beauty of this field is that you can index it. This is sometimes called (I believe) a Quad-tree. If you need to select a big area in your database, you just calculate the 64bit top-left coordinate (in the 4294967295 x 4294967295 grid system) and the bottom-left, and it's guaranteed that anything that lies in that box, will also be within the two numbers.

How do you get to those numbers. Lets be lazy and assume that both our x and y coordinate have range from -180 to 180 degrees. (The y coordinate of course is half that, but we're lazy).

First we make it positive:

// assuming x and y are our long and lat.

var x+=180;
var y+=180;

So the max for those is 360 now, and (4294967295 / 360 is around 11930464).

So to convert to our new grid system, we just do:

var x*=11930464;
var y*=11930464;

Now we have to distinct numbers, and we need to turn them into 1 number. First bit 1 of x, then bit 1 of y, bit 2 of x, bit 2 of y, etc.

// The 'morton number' 
morton = 0
// The current bit we're interleaving
bit = 1
// The position of the bit we're interleaving
position = 0

while(bit <= latitude or bit <= longitude) {

  if (bit & latitude) morton = morton | 1 << (2*position+1)
  if (bit & longitude) morton = morton | 1 << (2*position)

  position += 1
  bit = 1 << position

}

I'm calling the final variable 'morton', the guy who came up with it in 1966.

So this leaves us finally with the following:

For each row in your database, calculate the morton number and store it.
Whenever you do a query, first determine the maximum bounding box (as the morton number) and filter on that.

This will greatly reduce the number of records you need to check.

Here's a stored procedure I wrote that will do the calculation for you:

CREATE FUNCTION getGeoMorton(lat DOUBLE, lng DOUBLE) RETURNS BIGINT UNSIGNED DETERMINISTIC 
BEGIN

  -- 11930464 is round(maximum value of a 32bit integer / 360 degrees) 

  DECLARE bit, morton, pos BIGINT UNSIGNED DEFAULT 0;  

  SET @lat = CAST((lat + 90) * 11930464 AS UNSIGNED);
  SET @lng = CAST((lng + 180) * 11930464 AS UNSIGNED);
  SET bit = 1;

  WHILE bit <= @lat || bit <= @lng DO 

    IF(bit & @lat) THEN SET morton = morton | ( 1 << (2 * pos + 1)); END IF;
    IF(bit & @lng) THEN SET morton = morton | ( 1 << (2 * pos)); END IF;

    SET pos = pos + 1;

    SET bit = 1 << pos;

  END WHILE; 

  RETURN morton;
END;

A few caveats:

The absolute worst case scenario will still scan 50% of your entire table. This chance is extremely low though, and I've seen absolutely significant performance increases for most real-world queries.
The bounding box in this case assumes a Eucllidean space , meaning.. a flat surface. In reality your bounding boxes are not exact squares, and they warp heavily when getting closer to the poles. By just making the boxes a bit larger (depending on how exact you want to be) you can get quite far. Most real-world data is also often not close to the poles ;). Remember that this filter is just a 'rough filter' to get the most of the likely unwanted rows out.
This is based on a so-called Z-Order curve . To get even better performance, if you're feeling adventurous.. you could try to go for the Hilbert Curve instead . This curve oddly rotates, which ensures that in a worst case scenario, you will only scan about 25% of the table.. Magic! In general this one will also filter much more unwanted rows.

Source for all this: I wrote 3 blogposts about this topic when I came to the same problems and tried to creatively get to a solution. I got much better performance with this compared to MySQL's GEO indexes.

How can I optimize this stored procedure?

Question

1 answers

solution1
2 2013-03-26 02:14:10

How can I optimize this stored procedure?

Question

1 answers

solution1 2 2013-03-26 02:14:10

solution1
2 2013-03-26 02:14:10