简体   繁体   English

使用一个文本文件在 Java 中搜索另一个文本文件

[英]Using one text file to search through another text file in Java

I'm trying to search through a file (File B) for matching strings from another file (File A).我正在尝试在文件(文件 B)中搜索来自另一个文件(文件 A)的匹配字符串。 If the string is found in File A, then print the entire line(s) from File B and also update its progress to its corresponding JProgressBar(s) as the lines are being read.如果在文件 A 中找到该字符串,则打印文件 B 中的整行,并在读取行时将其进度更新到其相应的 JProgressBar(s)。

The code below is working fine as expected, but the issue is performance.下面的代码按预期工作正常,但问题是性能。 When dealing with large files, it takes about 15 minutes to scan just 5 thousand lines.在处理大文件时,仅扫描 5000 行大约需要 15 分钟。

I'm really looking for a way to process large files for example 500K lines.我真的在寻找一种处理大文件的方法,例如 500K 行。

Please suggest if this can be enhanced to handle large files or which part of my code is causing the slowness.请建议是否可以增强此功能以处理大文件或我的代码的哪一部分导致速度缓慢。

import java.awt.BorderLayout;
import java.awt.EventQueue;
import java.awt.TextField;

import javax.swing.JFrame;
import javax.swing.JPanel;
import javax.swing.border.EmptyBorder;
import javax.swing.JFileChooser;
import javax.swing.JProgressBar;
import javax.swing.JTextArea;
import javax.swing.JButton;

import java.awt.Font;

import javax.swing.JTextField;
import javax.swing.JLabel;
import javax.swing.JScrollPane;

import java.awt.event.ActionListener;
import java.awt.event.ActionEvent;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
import java.time.LocalDateTime;


public class Test_MultiJProgressBars_MultiFileReads extends JFrame {

 private JPanel contentPane;
 private JTextField textField_File1;
 private JTextField textField_File2;
 private JProgressBar progressBar_F1;
 private JProgressBar progressBar_F2;
 private JTextArea textArea_File1;

 /**
  * Launch the application.
  */
 public static void main(String[] args) {
         EventQueue.invokeLater(new Runnable() {
                 public void run() {
                 try {
                      Test_MultiJProgressBars_MultiFileReads frame = new Test_MultiJProgressBars_MultiFileReads();
                      frame.setVisible(true);
                      } catch (Exception e) {
                      e.printStackTrace();
                      }
                 }
         });
 }

 /**
  * Create the frame.
  */


 public void FileLineCount (JTextField TexFieldName, JProgressBar ProgressBarName) throws IOException {
         File FileX = new File (TexFieldName.getText());
         FileReader Fr = new FileReader(FileX);
         LineNumberReader Lnr = new LineNumberReader(Fr);

         int lineNumber =0 ;
         while (Lnr.readLine() !=null) {
                 lineNumber++;
         }
         // Setting Maximum Value on ProgressBar
         ProgressBarName.setMaximum(lineNumber);
         System.out.println("Total line in file : "+lineNumber);
         Lnr.close();
 }


 public void ScanFileForMatches() {
         File My_Refernce_File = new File (textField_File1.getText());
         File My_Source_File = new File (textField_File2.getText());

         int F1_JP_v = 0;
         int F2_JP_v = 0;

         try {
                 BufferedReader F1_br = new  BufferedReader(new FileReader(My_Refernce_File));

                 String F1_br_Line;
                 String F2_br_Line = null;

                 while ((F1_br_Line = F1_br.readLine()) !=null) {
                         //System.out.println("File 1 : "+F1_br_Line+"\n");
                         F1_JP_v++;
                         progressBar_F1.setValue(F1_JP_v);


                          try {
                               BufferedReader F2_br = new BufferedReader(new FileReader(My_Source_File));
                               while ((F2_br_Line = F2_br.readLine()) !=null) {
                                F2_JP_v++;
                                progressBar_F2.setValue(F2_JP_v);

                                if (F1_br_Line.contains(F2_br_Line)) {
                                        System.out.println("MATCHED --- File 1:"+F1_br_Line+" File 2:"+F2_br_Line+"\n");
                                        textArea_File1.append(LocalDateTime.now()+" : SYSOUT : MATCHED --- File 1:= "+F1_br_Line"\n");

                                } else {
                                        System.out.println("NOMATCH --- File 1:"+F1_br_Line+" File 2:"+F2_br_Line+"\n");

                                }
                                // Reset Progressbar after each Loop.
                                progressBar_F2.setValue(0);
                                 }
                                 // Set Progressbar to last value in the loop.
                                 progressBar_F2.setValue(F2_JP_v);
                                 F2_br.close();
                                 } catch (Exception e) {
                                         // TODO: handle exception
                             }
                 }
                 F1_br.close();
         } catch (Exception e) {
                 // TODO: handle exception
         }
 }


 public Test_MultiJProgressBars_MultiFileReads() {
         setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
         setBounds(100, 100, 799, 568);
         contentPane = new JPanel();
         contentPane.setBorder(new EmptyBorder(5, 5, 5, 5));
         setContentPane(contentPane);
         contentPane.setLayout(null);

         progressBar_F1 = new JProgressBar();
         progressBar_F1.setStringPainted(true);
         progressBar_F1.setBounds(10, 96, 763, 50);
         contentPane.add(progressBar_F1);

         progressBar_F2 = new JProgressBar();
         progressBar_F2.setStringPainted(true);
         progressBar_F2.setBounds(10, 169, 763, 50);
         contentPane.add(progressBar_F2);

         JScrollPane scrollPane = new JScrollPane();
         scrollPane.setBounds(10, 264, 763, 109);
         contentPane.add(scrollPane);

         textArea_File1 = new JTextArea();
         scrollPane.setViewportView(textArea_File1);

         JScrollPane scrollPane_1 = new JScrollPane();
         scrollPane_1.setBounds(10, 409, 763, 110);
         contentPane.add(scrollPane_1);

         JTextArea textArea_FIle2 = new JTextArea();
         scrollPane_1.setViewportView(textArea_FIle2);

         JButton btnStart = new JButton("SCAN");
         btnStart.addActionListener(new ActionListener() {
                 public void actionPerformed(ActionEvent arg0) {


                         // Call FileLineCount Method and setMaximum value on respective JPorgress Bars.
                         try {
                                 FileLineCount(textField_File1,progressBar_F1);
                                 FileLineCount(textField_File2,progressBar_F2);
                         } catch (IOException e) {
                                 // TODO Auto-generated catch block
                                 e.printStackTrace();
                         }
                         // Call ScanFileForMatches to Scan files and Update JProgress Bars.

                         Thread t1 = new Thread (new Runnable() {

                                 @Override
                                 public void run() {
                                         // TODO Auto-generated method stub
                                         //ScanFileForMatches();
                                         ScanFileForMatches_TEST();
                                 }
                         });
                         t1.start();

                 }
         });
         btnStart.setFont(new Font("Tahoma", Font.BOLD, 11));
         btnStart.setBounds(684, 10, 89, 57);
         contentPane.add(btnStart);

         textField_File1 = new JTextField();
         textField_File1.setBounds(10, 10, 486, 23);
         contentPane.add(textField_File1);
         textField_File1.setColumns(10);

         textField_File2 = new JTextField();
         textField_File2.setBounds(10, 44, 486, 23);
         contentPane.add(textField_File2);
         textField_File2.setColumns(10);

         JButton btnFile_File1 = new JButton("File 1");
         btnFile_File1.addActionListener(new ActionListener() {
                 public void actionPerformed(ActionEvent arg0) {
                         JFileChooser JFC_File1 = new JFileChooser();
                         JFC_File1.showOpenDialog(null);
                         File JFC_File1_Name = JFC_File1.getSelectedFile();
                         textField_File1.setText(JFC_File1_Name.getAbsolutePath());
                 }
         });
         btnFile_File1.setBounds(506, 10, 89, 23);
         contentPane.add(btnFile_File1);


         JButton btnFile_File2 = new JButton("File 2");
         btnFile_File2.addActionListener(new ActionListener() {
                 public void actionPerformed(ActionEvent arg0) {
                         JFileChooser JFC_File2 = new JFileChooser();
                         JFC_File2.showOpenDialog(null);
                         File JFC_File2_Name = JFC_File2.getSelectedFile();
                         textField_File2.setText(JFC_File2_Name.getAbsolutePath());
                 }
         });
         btnFile_File2.setBounds(506, 44, 89, 23);
         contentPane.add(btnFile_File2);


         JLabel lblFile = new JLabel("File 1 Progress");
         lblFile.setBounds(20, 78, 137, 14);
         contentPane.add(lblFile);

         JLabel lblFile_1 = new JLabel("File 2 Progress");
         lblFile_1.setBounds(20, 150, 137, 14);
         contentPane.add(lblFile_1);

         JLabel lblFileLog = new JLabel("File 2 Log");
         lblFileLog.setBounds(20, 384, 147, 14);
         contentPane.add(lblFileLog);

         JLabel lblFileLog_1 = new JLabel("File 1 Log");
         lblFileLog_1.setBounds(20, 239, 147, 14);
         contentPane.add(lblFileLog_1);
 }
}

Your current solution is lineary iterating through file1 , and for each line lineary iterating through file2 .您当前的解决方案是通过file1 进行线性迭代,并且对于每行通过file2 进行线性迭代。 This effectively results in a running time of O(F1*F2) : The time it takes to run will scale quadratically by the numer of lines (F1 and F2) in your files.这有效地导致O(F1*F2)的运行时间:运行所需的时间将按文件中的行数(F1 和 F2)进行二次缩放。 Plus file2 is put into memory each time it's checked for a match, which is very expensive.加上file2每次检查匹配时都会放入内存,这是非常昂贵的。

A better solution would be to read file2 into memory (Eg. an ArrayList ) and sort it:更好的解决方案是将file2读入内存(例如ArrayList )并对其进行排序:

Collections.sort(file2);

Then file1 could be iterated as you currently do, and for each line use Binary Search to check if that String exists in file2 :然后可以像当前一样迭代file1 ,并且对于每一行使用二进制搜索来检查该 String 是否存在于file2 中

for (String s1 : file1) int index = Collections.binarySearch(file2, s1);

Index would be non-negative if s1 is in file2 .如果 s1 在file2 中,则索引将为非负值。

This solution takes linearithmic time instead of quadratic and thus scales much better on larger inputs.这个解决方案需要线性时间而不是二次,因此在更大的输入上扩展更好。

If you would like to improve the time it takes to sort, consider MSD Sort instead of Collections.sort .如果您想缩短排序所需的时间,请考虑使用MSD Sort而不是Collections.sort Only a minor improvement, but hey, it counts.只有很小的改进,但是嘿,这很重要。

  1. You could try to sort the rows in file A, ie the file you are searching in. This way, you can perform a binary search in it ( inspiration ).您可以尝试对文件 A 中的行进行排序,即您正在搜索的文件。这样,您可以在其中执行二进制搜索( 灵感)。

  2. As a second step, I would create two threads:第二步,我将创建两个线程:

    • 1 search thread, that will search in the file A 1 个搜索线程,将在文件 A 中搜索
    • 1 reader thread, that will read the file B 1 个读取器线程,它将读取文件 B

The B-reader fetches a block of rows in memory (rather than a single row). B-reader 读取内存中的一块行(而不是单行)。 Then it starts a A-reader thread, which performs a binary search while B-reader keeps going through B to fetch the next block of rows.然后它启动一个 A-reader 线程,当 B-reader 继续通过 B 来获取下一个行块时,该线程执行二进制搜索。

  1. You could end the inner loop after the first match (if you are allowed to).您可以在第一场比赛后结束内循环(如果允许的话)。

  2. I would attempt to reduce the size of your try blocks, which may prevent some JVM optimizations.我会尝试减小 try 块的大小,这可能会阻止某些 JVM 优化。 Even if performance does not change much, I don't see the point of including into try blocks any instructions that cannot trigger exceptions ( ref ).即使性能没有太大变化,我也不认为在 try 块中包含任何不能触发异常的指令 ( ref ) 的意义。

  3. You should instrument your code to understand where it is spending most of the time, so that you can fine tune that part of the code.您应该检测您的代码以了解它大部分时间花费在哪里,以便您可以微调该部分代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM