简体   繁体   中英

need to find remove duplicates from a text file comparing 1st and 5th string from every line

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically find a duplicates in the file, I stored two strings in two arrayLists and iterating but it was not working because of nested for loops i am getting into the condition manyways.

I need an integrated solution for this, however. Preferably in Java. Any ideas? List item

    public class duplicates {
        static BufferedReader reader = null;
        static BufferedWriter writer = null;
        static String currentLine;

        public static void main(String[] args) throws IOException {
            int count=0,linecount=0;;
            String fe = null,fie = null,pe=null;
            File file = new File("E:\\Book.txt");

            ArrayList<String> list1=new ArrayList<String>();
            ArrayList<String> list2=new ArrayList<String>();

            reader = new BufferedReader(new FileReader(file));

            while((currentLine = reader.readLine()) != null)
            {
                StringTokenizer st = new StringTokenizer(currentLine,"/");  //splits data into strings
                while (st.hasMoreElements()) {
                    count++;
                    fe=(String) st.nextElement();
                    //System.out.print(fe+"/// ");

                    //System.out.println("count="+count);
                    if(count==1){                                            //stores 1st string 
                        pe=fe;
                        //  System.out.println("first element "+fe);
                    }
                    else if(count==5){
                        fie=fe;                                              //stores 5th string
                        //  System.out.println("fifth element "+fie);
                    }
                }
                count=0;

                if(linecount>0){
                    for(String s1:list1)
                    {
                        for(String s2:list2){
                            if(pe.equals(s1)&&fie.equals(s2)){                              //checking condition
                                System.out.println("duplicate found");
                                //System.out.println(s1+ "   "+s2);
                            }        
                        }
                    }
                }                     
                list1.add(pe);
                list2.add(fie);
                linecount++;
            }
        }
    }

i/p:

/book1/_cwc/B737/customer/Special_Reports/
/Airbook/_cwc/A330-200/customer/02_Watchlists/
/book1/_cwc/B737/customer/Special_Reports/
/jangeer/_cwc/Crj_200/customer/plots/
/Airbook/_cwc/A330-200/customer/02_Watchlists/
/jangeer/_cwc/Crj_200/customer/06_Performance_Summaries/
/jangeer/_cwc/Crj_200/customer/02_Watchlists/
/jangeer/_cwc/Crj_200/customer/01_Highlights/
/jangeer/_cwc/ERJ170/customer/01_Highlights/

o/p:

/book1/_cwc/B737/customer/Special_Reports/
/Airbook/_cwc/A330-200/customer/02_Watchlists/
/jangeer/_cwc/Crj_200/customer/plots/
/jangeer/_cwc/Crj_200/customer/06_Performance_Summaries/
/jangeer/_cwc/Crj_200/customer/02_Watchlists/
/jangeer/_cwc/Crj_200/customer/01_Highlights/
public static void removeDups() {
        String[] input = new String[] { //Lets say you read whole file in this string array
                "/book1/_cwc/B737/customer/Special_Reports/",
                "/Airbook/_cwc/A330-200/customer/02_Watchlists/",
                "/book1/_cwc/B737/customer/Special_Reports/",
                "/jangeer/_cwc/Crj_200/customer/plots/",
                "/Airbook/_cwc/A330-200/customer/02_Watchlists/",
                "/jangeer/_cwc/Crj_200/customer/06_Performance_Summaries/",
                "/jangeer/_cwc/Crj_200/customer/02_Watchlists/",
                "/jangeer/_cwc/Crj_200/customer/01_Highlights/",
                "/jangeer/_cwc/ERJ170/customer/01_Highlights/"
        };
        ArrayList<String> outPut = new ArrayList<>(); //The array list for storing output i.e. distincts.
        Arrays.stream(input).distinct().forEach(x -> outPut.add(x)); //using java 8 and stream you get distinct from input
        outPut.forEach(System.out::println); //I will write back to the file, just for example I am printing out everything but you can write back the output to file using your own implementation.
    }

The output when I ran this method was

/book1/_cwc/B737/customer/Special_Reports/
/Airbook/_cwc/A330-200/customer/02_Watchlists/
/jangeer/_cwc/Crj_200/customer/plots/
/jangeer/_cwc/Crj_200/customer/06_Performance_Summaries/
/jangeer/_cwc/Crj_200/customer/02_Watchlists/
/jangeer/_cwc/Crj_200/customer/01_Highlights/
/jangeer/_cwc/ERJ170/customer/01_Highlights/

EDIT

Non Java 8 answer

public static void removeDups() {
        String[] input = new String[] {
                "/book1/_cwc/B737/customer/Special_Reports/",
                "/Airbook/_cwc/A330-200/customer/02_Watchlists/",
                "/book1/_cwc/B737/customer/Special_Reports/",
                "/jangeer/_cwc/Crj_200/customer/plots/",
                "/Airbook/_cwc/A330-200/customer/02_Watchlists/",
                "/jangeer/_cwc/Crj_200/customer/06_Performance_Summaries/",
                "/jangeer/_cwc/Crj_200/customer/02_Watchlists/",
                "/jangeer/_cwc/Crj_200/customer/01_Highlights/",
                "/jangeer/_cwc/ERJ170/customer/01_Highlights/"
        };

        LinkedHashSet<String> output = new LinkedHashSet<String>(Arrays.asList(input)); //output is your set of unique strings in preserved order

    }

Use a Set<String> instead of Arraylist<String> .

Duplicates aren't allowed in a Set, so if you just add everyline to it, then get them back out, you'll have all distinct strings.

Performance-wise it's also quicker than your nested for-loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM