简体   繁体   中英

SQL Duplicates - Not finding all of them

I've a problem which is annoying the hell out of me!

I have a database with several thousand users. The data originally came from a database which I cannot trust data from, so I have imported it into another 'clean-up' database to remove duplicate entries.

I performed the query:

SELECT uid, username 
FROM users
GROUP BY username 
HAVING COUNT(username)>1

This is a sample of my table in its present state:

uid     forename     surname     username
1       Jo           Bloggs      jobloggs
2       Jo           Bloggs      jobloggs
3       Jane         Doe         janedoe
4       Jane         Doe         janedoe

After performing the query above, I get the following sample result:

uid     forename     surname     username
2       Jo           Bloggs      jobloggs

As you can see, there are 2 duplicate users, however the query is only displaying one of these.

When I perform the query, I get 300~ results. Obviously if the query isn't pulling all the duplicates, I cant trust this result set to be accurate and can't proceed with the clean up.

Any idea's about what I can try?

Thanks

Phil

There's no good explanation for the resultset that is being returned.

According to the sample data, and your query, then you should be getting a second row:

3   janedoe

(Actually, it's arbitrary whether you get a uid value of 3 or 4 returned.)

Also, Be sure that your client is returning just a subset of rows, eg SQLyog has a "Limit rows" feature which limits the number of rows returned.

If that's not the issue, then the most likely explanation is that one of the 'janedoe' includes non-printable characters, or you've got some wicked characterset conversions going on where two different encodings are displaying the same value.

As a quick first step, I'd suggest you check the number of characters in each of those 'janedoe' values:

SELECT username, LENGTH(username) FROM mytable WHERE uid IN (3,4) ORDER BY uid

Also, you could try displaying the actual encodings, using the HEX() function to see if there's a difference. (NOTE: It's not clear to me whether a characterset translation occurs before or after the HEX, what we're after here is a MySQL equivalent of the Oracle DUMP() function, which will display a byte by byte representation of the actual value.)

It's possible that you've got some Latin1 encodings mangled into UTF-8, or vice versa, or some other characterset weirdness going on. This may give you some ideas...

SELECT username
     , HEX(username)
     , HEX(BINARY username)
     , CONVERT(BINARY username USING latin1) 
     , CONVERT(BINARY username USING utf8)
  FROM mytable 
 WHERE uid IN (3,4)
 ORDER BY uid

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM