简体   繁体   English

SQL重复项-找不到全部

[英]SQL Duplicates - Not finding all of them

I've a problem which is annoying the hell out of me! 我有一个烦人的问题!

I have a database with several thousand users. 我有一个拥有数千名用户的数据库。 The data originally came from a database which I cannot trust data from, so I have imported it into another 'clean-up' database to remove duplicate entries. 数据最初来自我不能信任的数据的数据库,因此我已将其导入另一个“清理”数据库中以删除重复的条目。

I performed the query: 我执行了查询:

SELECT uid, username 
FROM users
GROUP BY username 
HAVING COUNT(username)>1

This is a sample of my table in its present state: 这是我的表处于当前状态的示例:

uid     forename     surname     username
1       Jo           Bloggs      jobloggs
2       Jo           Bloggs      jobloggs
3       Jane         Doe         janedoe
4       Jane         Doe         janedoe

After performing the query above, I get the following sample result: 执行完上面的查询后,我得到以下示例结果:

uid     forename     surname     username
2       Jo           Bloggs      jobloggs

As you can see, there are 2 duplicate users, however the query is only displaying one of these. 如您所见,有2个重复的用户,但是查询仅显示其中一个。

When I perform the query, I get 300~ results. 当我执行查询时,我得到300个结果。 Obviously if the query isn't pulling all the duplicates, I cant trust this result set to be accurate and can't proceed with the clean up. 显然,如果查询没有提取所有重复项,则我不能相信此结果集是准确的,并且无法进行清理。

Any idea's about what I can try? 关于我可以尝试的想法吗?

Thanks 谢谢

Phil 菲尔

There's no good explanation for the resultset that is being returned. 对于返回的结果集没有很好的解释。

According to the sample data, and your query, then you should be getting a second row: 根据示例数据和您的查询,您应该得到第二行:

3   janedoe

(Actually, it's arbitrary whether you get a uid value of 3 or 4 returned.) (实际上,返回的uid值是3还是4是任意的。)

Also, Be sure that your client is returning just a subset of rows, eg SQLyog has a "Limit rows" feature which limits the number of rows returned. 另外,请确保您的客户端仅返回行的子集,例如SQLyog具有“限制行”功能,该功能会限制返回的行数。

If that's not the issue, then the most likely explanation is that one of the 'janedoe' includes non-printable characters, or you've got some wicked characterset conversions going on where two different encodings are displaying the same value. 如果这不是问题,那么最可能的解释是“ janedoe”之一包含不可打印的字符,或者您进行了一些邪恶的字符集转换,其中两种不同的编码显示相同的值。

As a quick first step, I'd suggest you check the number of characters in each of those 'janedoe' values: 作为第一步,我建议您检查每个“ janedoe”值中的字符数:

SELECT username, LENGTH(username) FROM mytable WHERE uid IN (3,4) ORDER BY uid

Also, you could try displaying the actual encodings, using the HEX() function to see if there's a difference. 另外,您可以尝试使用HEX()函数显示实际的编码,以查看是否存在差异。 (NOTE: It's not clear to me whether a characterset translation occurs before or after the HEX, what we're after here is a MySQL equivalent of the Oracle DUMP() function, which will display a byte by byte representation of the actual value.) (注意:我不清楚字符集转换是在十六进制之前还是之后发生的,在此之后,我们所得到的是MySQL等效于Oracle DUMP()函数,该函数将逐字节显示实际值。 )

It's possible that you've got some Latin1 encodings mangled into UTF-8, or vice versa, or some other characterset weirdness going on. 可能您已经将某些Latin1编码整合到UTF-8中,反之亦然,或者其他一些字符集异常。 This may give you some ideas... 这可能会给您一些想法...

SELECT username
     , HEX(username)
     , HEX(BINARY username)
     , CONVERT(BINARY username USING latin1) 
     , CONVERT(BINARY username USING utf8)
  FROM mytable 
 WHERE uid IN (3,4)
 ORDER BY uid

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM