简体   繁体   English

什么是在java中扫描非常大的文件的最快方法?

[英]what's the fastest way to scan a very large file in java?

Imagine I have a very large text file. 想象一下,我有一个非常大的文本文件。 Performance really matters. 表现真的很重要。

All I want to do is to scan it to look for a certain string. 我想要做的就是扫描它以寻找某个字符串。 Maybe I want to count how many I have of those, but it really is not the point. 也许我想算一下我有多少,但这不是重点。

The point is: what's the fastest way ? 关键是:什么是最快的方式?

I don't care about maintainance it needs to be fast. 我不关心它需要快速的维护。

Fast is key. 快速是关键。

For a one off search use a Scanner , as suggested here 对于一次性搜索,请使用扫描仪 ,如此处所示

A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). 一种比indexOf()快得多的简单技术是使用Scanner,方法是findWithinHorizo​​n()。 If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. 如果使用带有File对象的构造函数,Scanner将在内部创建一个FileChannel来读取该文件。 And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching. 对于模式匹配,它最终将使用Boyer-Moore算法进行有效的字符串搜索。

First of all, use nio ( FileChannel ) rather than the java.io classes. 首先,使用nio( FileChannel )而不是java.io类。 Second, use an efficient string search algorithm like Boyer-Moore. 其次,使用像Boyer-Moore这样的高效字符串搜索算法

If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene . 如果你需要为不同的字符串多次搜索同一个文件,你需要构建某种索引,所以看看Lucene

Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt . 将整个文件加载到内存中,然后使用字符串搜索算法(如Knuth Morris Pratt)查看。

Edit: 编辑:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. 一个快速谷歌显示这个字符串搜索库似乎已实现了一些不同的字符串搜索算法。 Note I've never used it so can't vouch for it. 注意我从来没有用过它所以不能担保它。

Whatever may be the specifics, memory mapped IO is usually the answer. 无论具体是什么,内存映射IO通常都是答案。

Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC. 编辑:根据您的要求,您可以尝试将文件导入SQL数据库,然后通过JDBC利用性能改进。

Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. Edit2:JavaRanch的这个主题有一些其他的想法,涉及FileChannel。 I think it might be exactly what you are searching. 我想这可能正是你在寻找的东西。

I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation. 我说你可以获得最快的将是在FileInputStreams之上使用BufferedInputStreams ...或者如果你想避免BufferedInputStream实例化,可以使用自定义缓冲区。

This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/ 这将比我更好地解释: http//java.sun.com/developer/technicalArticles/Programming/PerfTuning/

Use the right tool: full text-search library 使用正确的工具:全文搜索库

My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. 我的建议是在内存索引(或启用了缓存的基于文件的索引)中执行,然后对其执行搜索。 As @Michael Borgwardt suggested, Lucene is the best library out there. 正如@Michael Borgwardt所说,Lucene是那里最好的图书馆。

I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? 我不知道这是不是一个愚蠢的建议,但是不是一个非常有效的文件搜索工具吗? Maybe you can call it using Runtime.getRuntime().exec(..) 也许你可以使用Runtime.getRuntime().exec(..)来调用它Runtime.getRuntime().exec(..)

It depends on whether you need to do more than one search per file. 这取决于您是否需要为每个文件执行多个搜索。 If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. 如果只需要进行一次搜索,请从磁盘中读取文件并使用Michael Bogwart建议的工具进行解析。 If you need to do more than one search, you should probably build an index of the file with a tool like Lucene : read the file in, tokenise it, stick tokens in index. 如果你需要进行多次搜索,你应该使用像Lucene这样的工具构建文件的索引:读取文件,标记它,在索引中粘贴标记。 If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). 如果索引足够小,请将其放在RAM中(Lucene提供RAM选项或磁盘支持索引)。 If not keep it on disk. 如果不保留在磁盘上。 And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive. 如果它对于RAM来说太大而且您非常非常非常关注速度,请将索引存储在固态/闪存驱动器上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Java中用文件写一个非常小的字符串的最快方法是什么? - What's the fastest way to write a very small string to a file in Java? 在 Java 中读取大型 XML 文件的最快方法 - Fastest way to read a large XML file in Java 使用 Java 扫描端口的最快方法 - Fastest way to scan ports with Java 什么是从Oracle数据库获取大量数据到Java对象的最快方法 - what's the fastest way to get a large volume of data from an Oracle database into Java objects 用Java访问数据集的最快方法是什么? - What's the fastest way to access a dataset with Java? 什么是解析大型数据文件最快的文件/方式? - What is the fastest file / way to parse a large data file? 用Java读取大tiff图像的最快方法是什么? - What is the fastest way to read a large tiff image in java? 用Java编写/读取大表的最快方法是什么? - What is the fastest way to write/read large table of numbers in Java? 用Java编写大型文本文件的最有效方法是什么? - What's the most efficient way to write large text file in java? 将 memory 中的大量数据写入文件的最快方法是什么? - What is the fastest way to write a large amount of data from memory to a file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM