简体   繁体   English

删除重复项的好方法是什么?

[英]What is a good way to remove duplicates?

I have a varchar column.我有一个 varchar 列。 It contains values separated by semicolon (;).它包含由分号 (;) 分隔的值。

For example, it looks like例如,它看起来像

10;20;21;17;20;21;22; 10;20;21;17;20;21;22;

It's not always 7 elements.它并不总是 7 个元素。 It could contain anything from around 30 to 70. The reason they designed it this way is because the values are actually genome segments and it makes sense to enter or retrieve it collectively它可以包含大约 30 到 70 之间的任何内容。他们以这种方式设计它的原因是因为这些值实际上是基因组片段,并且集体输入或检索它是有意义的

I need to remove records with duplicate columns, so if I see another record with the same value as above, I need to remove it.我需要删除具有重复列的记录,因此如果我看到具有与上述相同值的另一条记录,则需要将其删除。

I also need to remove the record if it contains same values in another record.如果该记录在另一条记录中包含相同的值,我还需要删除该记录。 For example, I need to remove例如,我需要删除

10;;21;17;20;21;22; 10;;21;17;20;21;22;

because it's the same as the first but it doesn't have the second value, 20. If it's more complete than the first, I will remove the first one instead.因为它与第一个相同,但没有第二个值 20。如果它比第一个更完整,我将删除第一个。

1;2;3;4;5;6;7; and 1;2;3;4;5;6;7;8;1;2;3;4;5;6;7;8; are dups and I'm taking the 2nd one because it's more complete.是重复的,我选择第二个,因为它更完整。 1;2;3;4;5;6;;7 is also a duplicate. 1;2;3;4;5;6;;7也是重复的。 In this case, if they have 13 or more matched numbers and no mismatch, we will merge them so it becomes a single value 1;2;3;4;5;6;7;7;在这种情况下,如果它们有 13 个或更多匹配的数字并且没有不匹配,我们将合并它们,使其成为单个值1;2;3;4;5;6;7;7; . .

I can scan each record in java but I'm afraid that it will be complicated and time consuming, given that the table contains millions of records.我可以在java中扫描每条记录,但我担心它会很复杂且耗时,因为该表包含数百万条记录。 I was wondering if it's doable in oracle itself.我想知道它在 oracle 本身中是否可行。

My final goal is to calculate the frequency that those numbers occur.我的最终目标是计算这些数字出现的频率。 For instance, if number 10 appears 5 out of 100 times, it will be 5%.例如,如果数字 10 在 100 次中出现 5 次,则为 5%。 The calculation will be simple.计算将很简单。 However, I can't calculate this unless I make sure there's no duplicates in the table in the first place.但是,除非我首先确保表中没有重复项,否则我无法计算此值。

Note: This answer is a placeholder because the question looks in danger of closure but I think it will be worthy of an answer once all the rules are established.注意:这个答案是一个占位符,因为这个问题看起来有结束的危险,但我认为一旦所有规则都建立起来,它就值得回答。


It's trivial to remove the exact duplicates:删除完全重复的内容很简单:

delete from your_table y
where y.rowid not in ( select min(x.rowid)
                       from your_table x
                       group by x.genome_string)

The hard part is establishing duplicating strings which have exact matches and nulls.困难的部分是建立具有精确匹配和空值的重复字符串。 Merging rows makes the logic even more convoluted.合并行使逻辑更加复杂。

The sql below is a solution ONLY IF:下面的 sql 是一个解决方案,仅当:

  • 1;2;3;4;5; 1;2;3;4;5; is a more complete form of 1;2;;5是更完整的形式 1;2;;5
  • All your entries end with ;您的所有条目都以 ; 结尾

The request was tested using sqlite so perhaps it may need some changes for Oracle.该请求已使用 sqlite 进行了测试,因此可能需要对 Oracle 进行一些更改。

It expects a table "TEST" with a column "VALUE"它需要一个带有“VALUE”列的表“TEST”

SELECT 
    DISTINCT VALUE
from TEST As ORIGIN_TEST
WHERE NOT EXISTS (SELECT VALUE FROM TEST 
    WHERE 
        VALUE <> ORIGIN_TEST.VALUE AND
        (VALUE LIKE replace(ORIGIN_TEST.VALUE, ';;', ';_%;') OR
        VALUE LIKE ORIGIN_TEST.VALUE || '_%;')
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM