I am using the following SQL query on an SQLITE3 database. I want to randomly select N rows that have an id greater or equal
to a randomly generated number between [1,...,max(id)]
. The table contains 40 Million rows. thus max(id) = 40M
.
SELECT distinct tf_idf
FROM MY_TABLE
WHERE id >= (abs(random()) % (SELECT max(id) FROM MY_TABLE))
LIMIT L;
O(1)
. (SELECT max(id) FROM MY_TABLE)
is O(N)
. distinct tf_idf
SQL does not provide complexity guarantees. The best we can do is talk about the lower bound of what's theoretically possible, and keep in mind that other factors may dominate.
the complexity of (SELECT max(id) FROM MY_TABLE) is O(N).
or O(log N ), depending on your index, and whether or not it's used. Or possibly O(1), if max(id)
is treated specially.
The complexity of distinct
is likewise opaque. It implies a sort, which we can take to be O( n log n ). But it's only O( N ) if the data are already sorted, and cheaper still if they're known not to contain duplicates.
Looking at your query, I would approach your question this way:
id
, if extant tf_idf
id
and tf_idf
For example, suppose there is only 1 id
and L
is 2. If the cardinality of id
to tf_idf
is 1:1 -- with or without an index on id
-- the system will have to read all the rows in MY_TABLE
. If every id
is unique, but they all map to the same tf_idf
, an index would probably only add to the cost versus a linear scan. If the cardinality is 1:1 and id
is unique, then N ~ L : as the number of distinct pairs grows, the probability of randomly selecting a duplicate declines.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.