简体   繁体   English

在Java中的文本文件上创建简单索引

[英]creating a simple index on a text file in java

I need to implement a simple indexing scheme for a big text file. 我需要为大文本文件实现一个简单的索引方案。 The text file contains key value pairs and I need to read back a specific key value pair without loading the complete file in memory. 文本文件包含键值对,我需要回读特定的键值对,而不将完整的文件加载到内存中。 The text file is huge and contains millions of entries and the keys are not sorted. 文本文件很大,包含数百万个条目,并且键没有排序。 Different key-value pairs need to be read depending on user-input. 根据用户输入,需要读取不同的键值对。 So I don't want the complete file to be read every time. 所以我不希望每次都读取完整的文件。 Please let me know the exact classes and methods in java file handling api that would help to implement this in a simple and efficient way.I want to do this without using an external library such as lucene. 请让我知道Java文件处理api中的确切类和方法,这些类和方法将有助于以简单有效的方式实现此目标。我想在不使用lucene之类的外部库的情况下执行此操作。

As the comments pointed out, you're going to need to do a linear search of the entire file in worst case, and half of it on average. 正如评论所指出的,在最坏的情况下,您将需要对整个文件进行线性搜索,平均搜索一半。 But fortunately there are some tricks you can do. 但是幸运的是您可以做一些技巧。

If the file doesn't change much, then create a copy of the file in which the entries are sorted. 如果文件变化不大,请创建文件的副本,其中将对条目进行排序。 Ideally make records in the copy the same length, so that you can go straight to the Nth entry in the sorted file. 理想情况下,使副本中的记录具有相同的长度,以便您可以直接转到已排序文件中的第N个条目。

If you don't have the disk space for that, then create an index file, which has all the keys in the original file as key and the offset into the original file as the value. 如果没有足够的磁盘空间,则创建一个索引文件,该文件将原始文件中的所有键作为键,并将原始文件中的偏移量作为值。 Again used fixed length records. 再次使用固定长度记录。 Or better, make this index file a database. 或更妙的是,使该索引文件成为数据库。 Or load the original file into a database. 或将原始文件加载到数据库中。 In either case, disk storage is very cheap. 无论哪种情况,磁盘存储都非常便宜。

EDIT: To create the index file, open the main file using RandomAccessFile and read it sequentially. 编辑:要创建索引文件,请使用RandomAccessFile打开主文件并顺序读取。 Use the 'getFilePointer()' method at the start of each entry to read the position in the file, and store that plus the key in the index file. 在每个条目的开头使用“ getFilePointer()”方法读取文件中的位置,并将该键和键存储在索引文件中。 When looking up something read the file pointer from the index file and use the 'seek(long)' method to jump to the point in the original file. 查找内容时,请从索引文件中读取文件指针,然后使用“ seek(long)”方法跳转到原始文件中的该点。

I'd recommend building an index file. 我建议建立一个索引文件。 Scan the input file and write every key and its offset into a List , then sort the list and write it to the index file. 扫描输入文件,并将每个键及其偏移量写入List ,然后对列表进行排序并将其写入索引文件。 Then, whenever you want to look up a key, you read in the index file and do a binary search on the list. 然后,每当您要查找键时,就读入索引文件并在列表上进行二进制搜索。 Once you find the key you need, open the data file as a RandomAccessFile and seek to the position of the key. 找到所需的密钥后,将数据文件作为RandomAccessFile打开,并寻找密钥的位置。 Then you can read the key and the value. 然后,您可以读取键和值。

How about using the java scanner. 如何使用Java扫描器。

http://docs.oracle.com/javase/tutorial/essential/io/scanning.html http://docs.oracle.com/javase/tutorial/essential/io/scanning.html

import java.io.*;
import java.util.Scanner;

public class ScanXan {
    public static void main(String[] args) throws IOException {
        Scanner s = null;
        try {
            s = new Scanner(new BufferedReader(new FileReader("xanadu.txt")));

            while (s.hasNext()) {
                // **split the string and match it for your key here** 
                System.out.println(s.next());
            }
        } finally {
            if (s != null) {
                s.close();
            }
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM