简体   繁体   English

mySQL:如何基于四个字段识别重复项

[英]mySQL: How to identify duplicates based on four fields

I have read a few posts on SO on how to delete duplicates, by comparing a table with another instance of itself, however I don't want to delete the duplicates I want to compare them. 我已经阅读了几则关于如何通过将表与表本身的另一个实例进行比较来删除重复项的文章,但是我不想删除要比较的重复项。

eg. 例如。 I have the fields "id", "sold_price", "bruksareal", "kommunenr", "Gårdsnr" ,"Bruksnr", "Festenr", "Seksjonsnr". 我有字段“ id”,“ sold_price”,“ bruksareal”,“ kommunenr”,“Gårdsnr”,“ Bruksnr”,“ Festenr”,“ Seksjonsnr”。 All fields are int. 所有字段均为int。

I want to identify the rows that are duplicates/identical (the same bruksareal, kommunenr, gårdsnr, bruksnr,festenr and seksjonsnr). 我想识别重复/相同的行(相同的bruksareal,kommunenr,gardsnr,bruksnr,festenr和seksjonsnr)。 If identical then I want to give these rows a unique reference number. 如果相同,那么我想为这些行提供唯一的参考号。

I believe this will make is easier to identify the rows that I later want to compare on other fields (eg. such as "sold_price", "sold_date" etc..) 我相信这将使识别以后要在其他字段上进行比较的行更加容易(例如,例如“ sold_price”,“ sold_date”等。)

I'm open to suggestions if you believe my approach is wrong... 如果您认为我的方法是错误的,我欢迎您提出建议。

Perform a join on the table to itself across all fields, then use an exists , query, such as: 在表上跨所有字段对其自身执行联接,然后使用一个exists ,查询,例如:

Update Table1
Set reference = UUID()
Where exists (
 Select tb1.id
 from Table1 tb1 inner join Table1 tb2 on
  tb1.Field1 = tb2.Field1 AND
  tb1.Field2 = tb2.Field2 AND
  etc
 Where tb1.Id = Table1.Id
 And tb1.Id != tb2.Id
)

actually you can simplify with just a join 实际上,您只需加入即可简化

Update Table1
Set reference = UUID()
From Table1 inner join Table1 tb2 on
      Table1.Field1 = tb2.Field1 AND
      Table1.Field2 = tb2.Field2 AND
      etc
Where Table1.Id != tb2.Id

Depending on where you want to do that, i would go for a hash implementation. 根据您想在何处执行此操作,我将执行哈希实现。 For every insert, calculate the hash of the needed columns when you do the insert (trigger maybe), and after that you should be able to find out very easily what rows are duplicated (if you index that column, the queries should be pretty fast, but remember that that is still not a int column, so it will get a little slower over time). 对于每个插入,在执行插入操作时(可能触发),计算所需列的哈希值,然后,您应该能够很容易地找出重复的行(如果您对该列进行索引,则查询应该非常快,但请记住,它仍然不是int列,因此随着时间的推移它会变慢一点。

After this you can do whatever you please with the duplicated records, without very expensive queries on the database. 之后,您可以对重复的记录进行任何操作,而无需对数据库进行非常昂贵的查询。

Later edit: Make sure that you convert the null values into some defined value, since some of the mysql functions like MD5 will just return null if the operand is null. 以后的编辑:确保将空值转换为某些定义的值,因为如果操作数为空,则某些MySQL函数(如MD5)将仅返回空值。 The same goes for concat - if one operand is null, it will return null (the same is not valid for concat_ws though). 对于concat也是一样-如果一个操作数为null,它将返回null(尽管对于concat_ws无效)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM