简体   繁体   中英

How to check if record exists in database - fastest method

I have a table where I store unique text strings and then I check if that string exists in the database by doing select

String checkIfAlreadyScanned = "SELECT id FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

then I check if value exists. My database has around 5mil records; can I improve my method?

Maybe there is a way of creating a new attribute (hashedSTR) for example and convert string into some unique numberical value and then getting these numbers, instead of strings? Will that work faster? (will that work at all?)

To ensure the fastest processing, make sure:

  • The field you are searching on is indexed (you told about an "unique" string, so I suppose it is already the case. For this reason, "limit 1" is not necessary. Otherwise, it should be added)
  • You are using the ExecuteScalar() method of your Command object

Testing makes no sense, just include the "test" in the where clause:

INSERT INTO silly_table(the_text)
 'literal_text'
WHERE NOT EXISTS (
    SELECT *
    FROM silly_table
    WHERE the_text = 'literal_text'
    );

Now, you'll make the test only when it is needed : at the end of the statement the row will exist. There is no such thing as try .

For those that don't understand testing makes no sense : testing would make sense if the situation after the test would not be allowed to change after the test. That would need a test&lock scenario. Or, even worse: a test inside a transaction.

UPDATE: version that works (basically the same):

DROP TABLE exitsnot CASCADE;
CREATE TABLE exitsnot
        ( id SERIAL NOT NULL PRIMARY KEY
        , val INTEGER -- REFERENCES something
        , str varchar -- REFERENCES something
        );

INSERT INTO exitsnot (val)
SELECT 42
WHERE NOT EXISTS (
        SELECT * FROM exitsnot
        WHERE val = 42
        );
INSERT INTO exitsnot (str)
SELECT 'silly text'
WHERE NOT EXISTS (
        SELECT * FROM exitsnot
        WHERE str = 'silly text'
        );
SELECT version();

Output:

DROP TABLE
NOTICE:  CREATE TABLE will create implicit sequence "exitsnot_id_seq" for serial column "exitsnot.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "exitsnot_pkey" for table "exitsnot"
CREATE TABLE
INSERT 0 1
INSERT 0 1
                                           version                                            
----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.2 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
(1 row)
String checkIfAlreadyScanned = "SELECT 1 FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

如果您的结果集包含一行,那么您有一条记录

Limit the result set to 1:

String checkIfAlreadyScanned = @"
    SELECT id 
    FROM ""STRINGS_DB""  
    where STR ='" + mystring + @"'
    limit 1";

This, an index on that column, and the @Laurent suggestion for ExecuteScalar() will yield the best result.

Also if mystring has any chance to have been touched by the user then parametize the query to avoid sql injection.

A cleaner version:

String checkIfAlreadyScanned = @"
    SELECT id 
    FROM ""STRINGS_DB""  
    where STR = '@mystring'
    limit 1
    ".replace("@mystring", mystring);

How long are these text strings? If they are very long, you might get a performance improvement by storing a hash of the strings (along with the original strings).

CREATE TABLE strings_db (
    id       PRIMARY KEY INT,
    text     TEXT,
    hash     TEXT
);

Your hash column could store MD5 sums, CRC32s, or any other hash algorithm you choose. And it should be indexed.

Then modify your query to something like:

SELECT id FROM strings_db WHERE hash=calculate_hash(?)

If the average size of your text fields is sufficiently larger than the size of your hashes, doing the search on the shorter field will help with disk I/O. This also means additional CPU overhead when inserting and selecting, to calculate the hash, and additional disk space to store the hash. So all of these factors must be taken into consideration.

PS Always use prepared statements to avoid SQL injection attacks!

Actually, there is just such a thing like you ask for. But it has some limitations. PostgreSQL supports a hash index type:

CREATE INDEX strings_hash_idx ON "STRINGS_DB" USING hash (str);

Works for simple equality searches with = , just like you have it. I quote the manual on the limitations:

Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash. They are also not replicated over streaming or file-based replication. For these reasons, hash index use is presently discouraged.


A quick test on a real life table, 433k rows, 59 MB total:

SELECT * FROM tbl WHERE email = 'some.user@some.domain.com'
-- No index, sequnence scan: Total runtime: 188 ms  
-- B-tree index (default):   Total runtime:   0.046 ms  
-- Hash index:               Total runtime:   0.032 ms  

That's not huge, but something. The difference will be more substantial with longer strings than the email address in my test. Index creation was a matter of 1 or 2 sec. with either index.

[Edit] Limit results returned to return the first record it comes across that meets the criteria: For SqlServer: select TOP 1 ...; For mysql/postgres: select ... LIMIT 1;

If there can be multiples, perhaps adding a "TOP 1" to your select statement could return faster.

String checkIfAlreadyScanned = "SELECT TOP 1 id FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

That way, it only has to find the first instance of the string.

But, if you don't have multiples, you'll not likely see much benefit with this approach.

Like others have said, putting an index on it may help.

Assuming you don't actually need the id column, I think this gives the compiler the most chance to optimize:

select 1
where exists(
    select 1 
    from STRINGS_DB
    where STR = 'MyString'
)

While all the answer here have their merit, I wish to mention another aspect.

Building your query in this way and passing a string will not help the database engine to optimize your query. Instead you should write a stored procedure, call it passing a single parameter and let the database engine build a query plan and reuse your command.

Of course the field should be indexed

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM