简体   繁体   English

从 String[] 中删除重复项

[英]Remove duplicates from String[]

I've created a webscraper to pull links, every now and then I'm getting dupe entries, I'm storing them in an Array of String[], I've thrown a couple of loops together to check for dupes but failing to remove the duplicates.我创建了一个 webscraper 来拉链接,我时不时地收到重复的条目,我将它们存储在一个 String[] 数组中,我已经抛出了几个循环来检查重复但未能删除重复项。

    public static void checkForDupes(String[] links) {

    for (int i = 0; i<links.length; i++) {
        String check = links[i];

        for (String link : links) {

            if (check.equals(link)) {
                // remove link
            } 
        }       
    }

}

Simple arrays have a fixed number of references to objects with fixed position.简单数组对位置固定的对象有固定数量的引用。 Therefore you cannot remove objects, you can only replace them by others or null, which would leave wholes in the middle of the array.因此,您不能删除对象,只能用其他对象或 null 替换它们,这会在数组中间留下整体。 And that is possibly not what you want.这可能不是您想要的。

Since you cannot remove elements from an array, you would need to replace the whole array by a new one which contains only the wanted entries.由于您无法从数组中删除元素,因此您需要将整个数组替换为一个仅包含所需条目的新数组。 But that arises a new problem: At start you have to specify the size of the target array but you do not know which size you will finally need.但这产生了一个新问题:开始时您必须指定目标数组的大小,但您不知道最终需要哪个大小。 So you either need to oversize the array and store the real used length in another variable use a LinkedList which supports a variable size and good performance when adding elements.因此,您要么需要扩大数组并将实际使用的长度存储在另一个变量中,请使用LinkedList ,它在添加元素时支持可变大小和良好的性能。

In addition, such nested loops tend to become very slow when they are large.此外,此类嵌套循环在较大时往往会变得非常慢。 When you have more than 20 entries, collecting values in a HashSet is much quicker than such simple for-loops and also does eliminate duplicates already as a side effect.当您有 20 个以上的条目时,在HashSet收集值比这种简单的 for 循环要快得多,并且作为副作用已经消除了重复项。

A very similar class is the HashMap which would not eliminate duplicates.一个非常相似的类是HashMap ,它不会消除重复项。 You should google for an explanation how hashing algorithms and hash maps work.你应该谷歌解释散列算法和散列映射是如何工作的。 That is a very interesting topic.这是一个非常有趣的话题。

Example:例子:

import java.util.Arrays;
import java.util.HashSet;

public class Main
{
    public static void main(String[] args) throws Exception
    {

        String[] links = {"a","b","a","c","b","c","d","e","f"};

        HashSet<String> set=new HashSet<>();
        set.addAll(Arrays.asList(links));

        System.out.println(set);
    }
}

Outputs:输出:

[a, b, c, d, e, f]

To read an individual element out of the set, you may use set.get(index) .要从集合中读取单个元素,您可以使用set.get(index)

The following example shows how to achieve the same without HashSet:以下示例显示了如何在没有 HashSet 的情况下实现相同的目标:

import java.util.Arrays;
import java.util.HashSet;

public class Main
{
    public static void main(String[] args) throws Exception
    {

        String[] input = {"a", "b", "a", "c", "b", "c", "d", "e", "f"};

        String[] output = new String[input.length];
        int count = 0;

        // Iterate over the input array
        for (String in : input)
        {
            // Check if the string is already in the output array
            boolean found=false;
            for (String out : output)
            {
                if (in.equals(out))
                {
                    found=true;
                    break; // break the inner for loop, no need to continue the search
                }
            }

            if (!found)
            {
                output[count++]=in;
            }
        }

        System.out.println(Arrays.toString(output));
    }
}

Outputs:输出:

[a, b, c, d, e, f, null, null, null]

Note how I simplified the for loops.请注意我是如何简化 for 循环的。 Also notice that the output array contains some unused spaces.另请注意,输出数组包含一些未使用的空格。 The counter variable contains the real used size of the array.计数器变量包含数组的实际使用大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM