Optimizing MySQL query on large table

Question

Im using mysql with JDBC.

I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:

I have created three additional indexes on the table, see below:

Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:

My code for retrieving the data is below:

    Connection con = null;
    PreparedStatement pst = null;
    Statement stmt = null;
    ResultSet rs = null;

    String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
    String user = "bigd";
    String password = "XXXXX";

    try {
        Class.forName("com.mysql.jdbc.Driver");
        con = DriverManager.getConnection(url, user, password);
        String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
        stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
        stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
        stmt.setFetchSize(Integer.MIN_VALUE);
        rs = stmt.executeQuery(query);

        System.out.println("Start");
        while (rs.next()) {

            int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
            int x = (int) (maxLong * 100000) - tempLong;
            int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
            int y = (int) (maxLat * 100000) - tempLat;

            if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
                matrix[y][x] += 1;
            }
        }
        System.out.println("End");
        JSONObject obj = convertToCRS(matrix);
        return obj;

    }catch (ClassNotFoundException ex){
        Logger lgr = Logger.getLogger(Database.class.getName());
        lgr.log(Level.SEVERE, ex.getMessage(), ex);
        return null;
    }
    catch (SQLException ex) {
        Logger lgr = Logger.getLogger(Database.class.getName());
        lgr.log(Level.SEVERE, ex.getMessage(), ex);
        return null;
    } finally {
        try {
            if (rs != null) {
                rs.close();
            }
            if (pst != null) {
                pst.close();
            }
            if (con != null) {
                con.close();
            }
        } catch (SQLException ex) {
            Logger lgr = Logger.getLogger(Database.class.getName());
            lgr.log(Level.WARNING, ex.getMessage(), ex);
            return null;
        }
    }

Removing every line in the while(rs.next()) loop gives me the same horrible run-time.

My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?

Any help is appreciated.

EDIT After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:

Answer 1

Some ideas up front:

Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
You prepare a PreparedStatement but don't use it?!
Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.

The idex might look like:

CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);

Perform your SQL inside MySQL - what time do you get there?

Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.

Answer 2

According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.

That being said, you might try to use the following compound covering index to speed things up.

CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)

Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.

You could try doing some of the summarizing on the MySQL server.

SELECT COUNT(*) number_of_duplicates, 
       ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
  FROM 3dag
 WHERE timestamp BETWEEN "+startTime+" 
                     AND "+endTime+"
   AND HourOfDay=4
   AND DayOfWeek=3
 GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)

This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.

Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.

More suggestions

Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE . Storing and transferring numbers in varchar(30) columns is quite inefficient.

Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.

Finally, if your queries always look like this

SELECT Latitude, Longitude
   FROM 3dag
  WHERE timestamp BETWEEN SOME_VALUE 
                      AND ANOTHER_VALUE
    AND HourOfDay = SOME_CONSTANT_DAY
    AND DayOfWeek = SOME_CONSTANT_HOUR

this compound covering index should be ideal to accelerate your query.

CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)

Answer 3

I am extrapolating from my tracking app. This is what i do for efficiency:

Firstly, a possible solution depends on whether or not you can predict/control the time intervals. Store snapshots every X minutes or once a day, for example. Let us say you want to display all events YESTERDAY. You can save a snapshot that has already filtered your file. This would speed things up enormously, but is not a viable solution for custom time intervals and real live coverage.

My application is LIVE, but usually works pretty well in T+5 minutes (5 minute maximum lag/delay). Only when the user actually chooses live position viewing will the application open a full query on the live db. Thus, depends on how your app works.

Second factor: How you store your timestamp is very important. Avoid VARCHAR , for example. If you are converting UNIXTIME that also will give you unnecessary lagtime. Since you are developing what appears to be a geotracking application, your timestamp would be in unixtime - an integer. some devices work with milliseconds, i would recommend not using them. 1449878400 instead of 1449878400000 (12/12/2015 0 GMT)

I save all my geopoint datetimes in unixtime seconds and use mysql timestamps only for timestamping the moment the point was received by server (which is irrelevant to this query you propose).

You might shave some time off accessing an indexed view instead of running a full a query. Whether that time is significant in a large query is subject to testing.

Finally, you could shave an itsy bitsy more by not using BETWEEN and using something SIMILAR to what it will be translate into (pseudocode below)

WHERE (timecode > start_Time AND timecode < end_time)

See that i change >= and <= to > and < because chances are your timestamp will almost never be on the precise second and even if it is, you will rarely be afffected whether 1 geopoint/time event is or not displayed.

Optimizing MySQL query on large table

Question

3 answers

solution1
1 2015-12-13 13:32:54

solution2
1 2015-12-13 14:18:55

solution3
0 2015-12-13 14:33:18

Optimizing MySQL query on large table

Question

3 answers

solution1 1 2015-12-13 13:32:54

solution2 1 2015-12-13 14:18:55

solution3 0 2015-12-13 14:33:18

solution1
1 2015-12-13 13:32:54

solution2
1 2015-12-13 14:18:55

solution3
0 2015-12-13 14:33:18