简体   繁体   English

Java:使用状态的ASCII随机行文件访问

[英]Java: ASCII random line file access with state

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria? 有没有比创建符合以下条件的流式文件阅读器类更好的[预先存在的可选Java 1.6]解决方案?

  • Given an ASCII file of arbitrary large size where each line is terminated by a \\n 给定任意大尺寸的ASCII文件,其中每行以\\n结尾
  • For each invocation of some method readLine() read a random line from the file 对于某些方法的每次调用, readLine()从文件中读取一个随机行
  • And for the life of the file handle no call to readLine() should return the same line twice 并且对于文件句柄的生命周期,对readLine()调用不应该返回相同的行两次

Update: 更新:

  • All lines must eventually be read 最终必须读取所有行

Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; 上下文:文件的内容是从Unix shell命令创建的,以获取给定目录中包含的所有路径的目录列表; there are between millions to a billion files (which yields millions to a billion lines in the target file). 有数百万到十亿个文件(在目标文件中产生数百万到十亿行)。 If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well. 如果有一些方法可以在创建时间内将路径随机分配到文件中,这也是一种可接受的解决方案。

In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile instead of a standard java FileInputStream . 为了避免读取整个文件(在您的情况下可能无法读取),您可能希望使用RandomAccessFile而不是标准的Java FileInputStream With RandomAccessFile , you can use the seek(long position) method to skip to an arbitrary place in the file and start reading there. 使用RandomAccessFile ,您可以使用seek(long position)方法跳转到文件中的任意位置并开始在那里阅读。 The code would look something like this. 代码看起来像这样。

RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
    //seek to a random point in the file
    raf.seek((long)(Math.random()*raf.length()));

    //skip from the random location to the beginning of the next line
    int nextByte = raf.read();
    while(((char)nextByte) != '\n')
    {
        if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
        nextByte = raf.read();
    }

    //read the line into a buffer
    StringBuffer lineBuffer = new StringBuffer();
    nextByte = raf.read();
    while(nextByte != -1 && (((char)nextByte) != '\n'))
        lineBuffer.append((char)nextByte);

    //ensure uniqueness
    String line = lineBuffer.toString();
    if(sampledLines.get(line.hashCode()) != null)
        i--;
    else
       sampledLines.put(line.hashCode(),line);
}

Here, sampledLines should hold your randomly selected lines at the end. 这里, sampledLines应该在最后保留随机选择的行。 You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case. 您可能需要检查是否也没有随机跳过文件的末尾以避免在这种情况下出错。

EDIT: I made it wrap to the beginning of the file in case you reach the end. 编辑:我把它包装到文件的开头,以防你到达最后。 It was a pretty simple check. 这是一个非常简单的检查。

EDIT 2: I made it verify uniqueness of lines by using a HashMap . 编辑2:我通过使用HashMap验证了行的唯一性。

Pre-process the input file and remember the offset of each new line. 预处理输入文件并记住每个新行的偏移量。 Use a BitSet to keep track of used lines. 使用BitSet跟踪已使用的行。 If you want to save some memory, then remember the offset of every 16th line; 如果你想节省一些内存,那么记住每16行的偏移量; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines. 它仍然很容易跳入文件并在16行的块内进行顺序查找。

Since you can pad the lines, I would do something along those lines, and you should also note that even then, there may exist a limitation with regards to what a List can actually hold. 既然你可以填充线条,我会沿着这些线做一些事情,你还应该注意到,即使这样,关于List实际可以容纳的内容可能存在限制。

Using a random number each time you want to read the line and adding it to a Set would also do, however this ensures that the file is completely read: 每次要读取行并将其添加到Set时使用随机数也可以,但这可以确保完全读取文件:

public class VeryLargeFileReading
    implements Iterator<String>, Closeable
{
    private static Random RND = new Random();
    // List of all indices
    final List<Long> indices = new ArrayList<Long>();
    final RandomAccessFile fd;

    public VeryLargeFileReading(String fileName, long lineSize)
    {
        fd = new RandomAccessFile(fileName);
        long nrLines = fd.length() / lineSize;
        for (long i = 0; i < nrLines; i++)
            indices.add(i * lineSize);
        Collections.shuffle(indices);
    }

    // Iterator methods
    @Override
    public boolean hasNext()
    {
        return !indices.isEmpty();
    }

    @Override
    public void remove()
    {
        // Nope
        throw new IllegalStateException();
    }

    @Override
    public String next()
    {
        final long offset = indices.remove(0);
        fd.seek(offset);
        return fd.readLine().trim();
    }

    @Override
    public void close() throws IOException
    {
        fd.close();
    }
}

If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). 如果文件的数量确实是任意的,则在内存使用方面跟踪已处理文件可能存在相关问题(如果在文件中跟踪而不是列表或集合,则会出现IO时间)。 Solutions that keep a growing list of selected lines also run in to timing-related issues. 保持不断增加的选定行列表的解决方案也会遇到与时序相关的问题。

I'd consider something along the lines of the following: 我会考虑以下几点:

  1. Create n "bucket" files. 创建n “桶”文件。 n could be determined based on something that takes in to account the number of files and system memory. n可以基于考虑文件数量和系统内存的东西来确定。 (If n is large, you could generate a subset of n to keep open file handles down.) (如果n很大,您可以生成n的子集以保持打开文件句柄。)
  2. Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria. 每个文件的名称都经过哈希处理,并进入相应的存储桶文件,根据任意条件“分割”目录。
  3. Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity. 读取存储桶文件内容(只是文件名)并按原样处理(由散列机制提供随机性),或者选择rnd(n)并随时删除,提供更多的随机性。
  4. Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked. 或者,您可以填充并使用随机访问的想法,在选择列表时从列表中删除索引/偏移量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM