简体   繁体   中英

What is the fastest way to read/filter a text file

I'm trying to loop through a log text file, containing SSH logins and other logs.

The program is returning the total number of SSH logins.

My solution does work but seems a bit slow (~3.5 sec on a 200mo file). I would like to know if there are any ways to make it faster. I'm not really familiar with good practices on Java.

I'm using the BufferedReader class. Maybe there are better classes/methods but everything else I found online was slower.

{
            BufferedReader br;
            if(fileLocation != null) {
                br = new BufferedReader(new FileReader(fileLocation));
            }
            else {
                br = new BufferedReader((new InputStreamReader(System.in, "UTF-8")));
            }
            String line;
            Stack<String> users = new Stack<>();
            int succeeded = 0;
            int failed;
            int total = 0;

            if(!br.ready()) {
                help("Cannot read the file", true);
            }
            while((line=br.readLine())!=null)
            {
                if(!line.contains("sshd")) continue;
                String[] arr = line.split("\\s+");
                if(arr.length < 11) continue;


                String log = arr[4];
                String log2 = arr[5];
                String log3 = arr[8];
                String user = arr[10];
                if(!log.contains("sshd")) continue;
                if(!log2.contains("Accepted")) {
                    if(log3.contains("failure")) {
                        total++;
                    }
                    continue;
                }
                total++;
                succeeded++;

                if(!repeat) {
                    if (users.contains(user)) continue;
                    users.add(user);
                }

                System.out.println((total + 1) + " " + user);
            }

Full code : https://pastebin.com/xp2P9wja

Also, here's some lines of the log file :

Dec  3 12:20:12 k332 sshd[25206]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.147.222.137 
Dec  3 12:20:14 k332 sshd[25204]: error: PAM: Authentication failure for illegal user admin from 10.147.222.137
Dec  3 12:20:14 k332 sshd[25204]: Failed keyboard-interactive/pam for invalid user admin from 10.147.222.137 port 36417 ssh2
Dec  3 12:20:14 k332 sshd[25204]: Connection closed by invalid user admin 10.147.222.137 port 36417 [preauth]
Dec  3 12:20:40 k332 sshd[25209]: pam_tally2(sshd:auth): Tally overflowed for user root

Final output is :

Total :
103 unique IP SSH logins succeeded
30387 SSH logins succeeded
17186 SSH logins failed
47573 total SSH logins

Thanks for your time!

EDIT: Mo (Mega Octet) = MB (Mega Byte) (we usually say Mo in french)

Here's the full updated code is anyone needs it : https://pastebin.com/Kn5EqLNX

If you get a profile of your code, it becomes clear that the problem is in the String.split() method:

在此处输入图像描述

This is a known problem in the standard Java library: Java split String performances .

So in order to speed up your code, you need to speed up this part of the code in some way. The first thing I can suggest is to replace the code on lines 75-79 with this:

Pattern pattern = Pattern.compile("\\s+");
while ((line = br.readLine()) != null) {
    if (!line.contains("sshd")) continue;
    String[] arr = pattern.split(line);
    if (arr.length < 11) continue;
...
}

This may speed up the code a bit, but you can see from the profile that a lot of time is still spent in Pattern and Matcher methods. We need to get rid of Pattern and Matcher for a significant speedup.

For single-character patterns split works without using Regex and does it quite efficiently, let's try replacing the code with:

while ((line = br.readLine()) != null) {
    if (!line.contains("sshd")) continue;
    String[] arr = Arrays.stream(line.split(" "))
                    .filter(s -> !s.isEmpty())
                    .toArray(String[]::new);
    if (arr.length < 11) continue;
...
}

This code runs almost twice as fast on the same data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM