简体   繁体   中英

What is the complexity of an SQL query that randomly selects a subset of rows from a database?

Introduction

I am using the following SQL query on an SQLITE3 database. I want to randomly select N rows that have an id greater or equal to a randomly generated number between [1,...,max(id)] . The table contains 40 Million rows. thus max(id) = 40M .


SQL query

SELECT distinct tf_idf
       FROM MY_TABLE 
       WHERE id >= (abs(random()) % (SELECT max(id) FROM MY_TABLE)) 
       LIMIT L;

Complexity

  • The complexity of random() is O(1) .
  • the complexity of (SELECT max(id) FROM MY_TABLE) is O(N) .
  • I still can't compute the complexity for distinct tf_idf

SQL does not provide complexity guarantees. The best we can do is talk about the lower bound of what's theoretically possible, and keep in mind that other factors may dominate.

the complexity of (SELECT max(id) FROM MY_TABLE) is O(N).

or O(log N ), depending on your index, and whether or not it's used. Or possibly O(1), if max(id) is treated specially.

The complexity of distinct is likewise opaque. It implies a sort, which we can take to be O( n log n ). But it's only O( N ) if the data are already sorted, and cheaper still if they're known not to contain duplicates.

Looking at your query, I would approach your question this way:

  • a binary search along an index on id , if extant
  • a binary search along an index (putative) for output tf_idf
  • N times, where N is a function of the cardinality of id and tf_idf

For example, suppose there is only 1 id and L is 2. If the cardinality of id to tf_idf is 1:1 -- with or without an index on id -- the system will have to read all the rows in MY_TABLE . If every id is unique, but they all map to the same tf_idf , an index would probably only add to the cost versus a linear scan. If the cardinality is 1:1 and id is unique, then N ~ L : as the number of distinct pairs grows, the probability of randomly selecting a duplicate declines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM