简体   繁体   English

java-删除字符串列表中的子字符串

[英]java- removing substring in a list of strings

Consider the case of a list of strings example : list=['apple','bat','cow,'dog','applebat','cowbat','dogbark','help'] 考虑一个字符串列表的例子:list = ['apple','bat','cow,'dog','applebat','cowbat','dogbark','help']

The java code must check if any element of string is a subset of another element and if it is then larger string element must be removed. Java代码必须检查string的任何元素是否是另一个元素的子集,如果是,则必须删除较大的string元素。

so in this case strings 'applebat','cowbat','dogbark, are removed. 因此在这种情况下,字符串“ applebat”,“ cowbat”,“ dogbark”被删除了。

The approach I have taken was to take two lists and iterate over them in the following way, 我采用的方法是获取两个列表,并通过以下方式对其进行迭代,

ArrayList<String> list1 = new ArrayList<String>(strings);
ArrayList<String> list2 = new ArrayList<String>(strings);
for(int i = 0; i<list1.size();i++)
    {
        String curr1 = list1.get(i);

        for(int j = 0;j<list2.size();j++)
        {
            String curr2 = list2.get(j);

            if(curr2.contains(curr1)&&!curr2.equals(curr1))
            {
                list2.remove(j);
                j--;
        }
        }
    }

IMPORTANT I have lists with the sizes of 200K to 400K elements.I would like to find a way to improve performance. 重要提示我有200K到400K元素大小的列表。我想找到一种提高性能的方法。 I even tried hashsets but they were not much help.I am facing issues with the time taken by the program. 我什至尝试了哈希集,但它们并没有太大帮助,我在程序花费的时间上遇到了问题。

Can any one suggest any improvements to my code or any other approaches in java to improve performance?? 任何人都可以建议对我的代码进行任何改进或使用Java中的其他任何方法来提高性能吗?

import java.util.ArrayList;
import java.util.*;
// our main class becomes a file but the main method is still found
public class HelloWorld
{
  public static void main(String[] args)
  {
    String[] strings = {"apple","bat","cow","dog","applebat","cowbat","dogbark","help"};
    ArrayList<String> list1 = new ArrayList<String>(Arrays.asList(strings));
ArrayList<String> list2 = new ArrayList<String>(Arrays.asList(strings));
ArrayList<String> result = new ArrayList<String>(Arrays.asList(strings));
for(int i = 0; i<8;i++)
{

    String curr1 = list1.get(i);
    System.out.println(curr1);
    int flag = 0;
    for(int j = i+1;j<8;j++)
    {
        String curr2 = list2.get(j);

        if((curr2.contains(curr1)&&!curr2.equals(curr1)))
        {

            result.remove(curr2);
        }
    }

}
 System.out.println(result);

  }
}

I suppose set will be faster here. 我想这里的设置会更快。 You can easy do that with java8 stream api. 您可以使用java8流api轻松实现。

Try that: 试试看:

private Set<String> delete() {
        Set<String> startSet = new HashSet<>(Arrays.asList("a", "b", "c", "d", "ab", "bc", "ce", "fg"));
        Set<String> helperSet = new HashSet<>(startSet);

        helperSet.forEach(s1 -> helperSet.forEach(s2 -> {
            if (s2.contains(s1) && !s1.equals(s2)) {
                startSet.remove(s2);
            }
        }));

        return startSet;
    }

Do not delete any elements from set you are iterating for or you will have ConcurrentModificationException. 不要从要迭代的集合中删除任何元素,否则您将拥有ConcurrentModificationException。

For full performance boost of huge list of words, I would think a combination of sort and a string searching algorithm , such as the Aho–Corasick algorithm , is what you need, assuming you're willing to implement such complex logic. 为了充分发挥大量单词的性能,我认为您需要结合使用排序和字符串搜索算法 (例如Aho–Corasick算法)的组合前提是您愿意实现这种复杂的逻辑。

First, sort the words by length. 首先,按长度对单词进行排序。

Then build up the Aho–Corasick Dictionary, in word length order. 然后按单词长度顺序构建Aho–Corasick词典。 For each word, first check if a substring exists in the dictionary. 对于每个单词,首先检查字典中是否存在子字符串。 If it does, skip the word, otherwise add the word to the dictionary. 如果是这样,请跳过该单词,否则将其添加到字典中。

When done, dump the dictionary, or the parallel-maintained list if dictionary is not easy/possible to dump. 完成后,转储字典,如果字典不易转储,则转储并行维护的列表。

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Date;
import java.util.List;
import java.util.Random;

public class SubStrRmove {
    public static List<String> randomList(int size) {
        final String BASE = "abcdefghijklmnopqrstuvwxyz";
        Random random = new Random();
        List<String> list = new ArrayList<>();
        for (int i = 0; i < size; i++) {
            int length = random.nextInt(3) + 2;
            StringBuffer sb = new StringBuffer();
            for (int j = 0; j < length; j++) {
                int number = random.nextInt(BASE.length());
                sb.append(BASE.charAt(number));
            }
            list.add(sb.toString());
            sb.delete(0, sb.length());
        }
        return list;
    }

    public static List<String> removeListSubStr(List<String> args) {
        String[] input = args.toArray(new String[args.size()]);
        Arrays.parallelSort(input, (s1, s2) -> s1.length() - s2.length());
        List<String> result = new ArrayList<>(args.size());
        for (int i = 0; i < input.length; i++) {
            String temp = input[i];
            if (!result.stream().filter(s -> temp.indexOf(s) >= 0).findFirst().isPresent()) {
                result.add(input[i]);
            }
        }
        return result;
    }

    public static List<String> removeListSubStr2(List<String> args) {
        String[] input = args.toArray(new String[args.size()]);
        Arrays.parallelSort(input, (s1, s2) -> s1.length() - s2.length());
        List<String> result = new ArrayList<>(args.size());
        for (int i = 0; i < input.length; i++) {
            boolean isDiff = true;
            for (int j = 0; j < result.size(); j++) {
                if (input[i].indexOf(result.get(j)) >= 0) {
                    isDiff = false;
                    break;
                }
            }
            if (isDiff) {
                result.add(input[i]);
            }
        }
        return result;
    }

    public static void main(String[] args) throws InterruptedException {
        List<String> list = randomList(20000);
        Long start1 = new Date().getTime();
        List<String> listLambda = removeListSubStr(list);
        Long end1 = new Date().getTime();
        Long start2 = new Date().getTime();
        List<String> listFor = removeListSubStr2(list);
        Long end2 = new Date().getTime();
        System.out.println("mothod Labbda:" + (end1 - start1) + "ms");
        System.out.println("mothod simple:" + (end2 - start2) + "ms");
        System.out.println("" + listLambda.size() + listLambda);
        System.out.println("" + listFor.size() + listFor);

    }

}

I have tested it on small data and hope it helps you to find solution... 我已经在小数据上进行了测试,希望它能帮助您找到解决方案...

import java.util.ArrayList;
import java.util.Arrays;

public class Main {
    public static void main(String[] args){
        String []list = {"apple","bat","cow","dog","applebat","cowbat","dogbark","help","helpless","cows"};
        System.out.println(Arrays.toString(list));
        int prelenght = 0;
        int prolenght = 0;
        long pretime = System.nanoTime();
        for(int i=0;i<list.length;i++){
            String x = list[i];
            prelenght = list[i].length();
            for(int j=i+1;j<list.length;j++){               
                String y = list[j];
                if(y.equals(x)){
                    list[j] = "0";
                }else if(y.contains(x)||x.contains(y)){
                    prolenght = list[j].length();                   
                    if(prelenght<prolenght){
                        list[j] = "0";
                    }                       
                    if(prelenght>prolenght){
                        list[i] = "0";
                        break;
                    }
                }
            }
        }       
        long protime = System.nanoTime();
        long time = (protime - pretime);
        System.out.println(time + "ns");
        UpdateArray(list);      
    }

    public static void UpdateArray(String[] list){
        ArrayList<String> arrayList = new ArrayList<>();
        for(int i=0;i<list.length;i++){
            if(!list[i].equals("0")){
                arrayList.add(list[i]);
            }
        }
        System.out.println(arrayList.toString());
    }
}

Output : 输出:

[apple, bat, cow, dog, applebat, cowbat, dogbark, help, helpless, cows]
time elapsed : 47393ns
[apple, bat, cow, dog, help]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM