简体   繁体   中英

What is the best way to index email address in MYSql

I have a signup table with millions of email id record in it. Email ids are unique. What is the best way to index them and fetch them back using asp.net for authentication purpose? I mean should I define email id column as a clustered unique index rather than UNIQUE?

When you have a variable length textual input, such as e-mail or addresses, but you want them to be unique then the standard approach is to index the hash of that value.

Reason: hashes are fixed-length, and you are avoiding problems with text-data exceeding index length.

According to your comment, the table you have would look like this (I purposely omitted password and mobile number):

create table users (
    user_id int not null unsigned auto_increment,
    first_name varchar(255) not null,
    surname varchar(255) default null,
    email varchar(255) not null,
    primary key(id)
) engine = innodb;

I would alter that table and add a field that contains email hash. I'd maintain this has via a trigger, so that you can focus on getting valid data in without worrying about creating hashes. To do so, the field would be binary(20) since it will contain a raw hash and that takes 20 bytes. Since we want to maintain it via trigger, then we need to make that field nullable and unique. Note: you can make it binary(40)

Table:

create table users (
    user_id int not null unsigned auto_increment,
    email_hash binary(20) default null, -- this is the field in question
    first_name varchar(255) not null,
    surname varchar(255) default null,
    email varchar(255) not null,
    primary key(id),
    unique(email_hash) -- this is the unique index over the hash
) engine = innodb;

What we need now is a trigger that deals with email hashes. I'll show how to create the trigger which maintains this info before inserting. Similar logic applies for updating the table:

DELIMITER $$

CREATE TRIGGER users_before_insert BEFORE INSERT ON `users` 

FOR EACH ROW BEGIN
    SET NEW.email_hash = UNHEX(SHA1(new.email)); -- You can remove UNHEX if you want human-readable value. You'll need binary(40) to hold it then
END;

DELIMITER ;

From within your application, you'd simply provide values for first name, surname and email. MySQL will take care of duplicates and it will signal you with the state of 23000 . I don't know how to use asp.net so you'll have to adjust to its error handling somehow.

You can handle hashes from within your asp.net application, but if you feel more comfortable by having the database do this - I showed how to achieve it via triggers.

The same rule would apply for mobile number, if you require it to be unique or any other fields. Naturally, hashing the number might produce longer values for the hash than the actual number is, in which case you might simply directly make the mobile number unique .

I hope this helps a bit in your decision on what to do.

Too many things for a comment...

If you already have INDEX(email) , then simply turn it into UNIQUE(email) . The table (data+index) size will not change (more than a little due to the ALTER ).

If email is too big to index -- such as because it is TEXT -- then there is no way to add a UNIQUE index on email . In this case, the "hash" solution would work. Yes it would add megabytes to the disk usage, but this is unlikely to be an issue.

If you currently have id AUTO_INCREMENT and PRIMARY KEY(id) , then do you actually use id in other tables? If not, then there are other paths we can discuss, such as making email or hash the PRIMARY KEY . This might even shrink the disk footprint.

Regardless of what you do, use InnoDB.

If you're doing a unique key lookup it really doesn't make enough performance difference to worry about if the index is clustered or not. It might make sense (or not) to cluster it as you add more things to the table. The main thing is that you have a unique constraint and most likely this will be the primary key so you'll get that and a corresponding index. Performance will be fine - concern yourself with the other uses. eg if you want to do an analysis on domain you might need to decompose the email address. That might be more important. Like most things, it depends....

Hashing an Email Address Column in the Database for indexing can be achieved by altering DB to add a new field (email_hash):

ALTER TABLE user_meta ADD email_hash VARBINARY(32) NULL

Then set the value of the email_hash by:

UPDATE user_meta SET email_hash = MD5(email);

And then create a trigger like described, for example:

DELIMITER $$
CREATE TRIGGER users_meta_before_insert BEFORE INSERT ON 'user_meta'
FOR EACH ROW BEGIN
SET NEW.email_hash = MD5(email); -- You can remove UNHEX if you want human-readable value. You'll need binary(40) to hold it then
END;
DELIMITER ;

Also you may find this useful: https://www.koder.ly/2020/07/hashing-an-email-address/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM