简体   繁体   English

实时解析大型文本文件(Java)

[英]Parsing Large Text Files in Real-time (Java)

I'm interested in parsing a fairly large text file in Java (1.6.x) and was wondering what approach(es) would be considered best practice? 我有兴趣在Java(1.6.x)中解析一个相当大的文本文件,并想知道哪种方法被认为是最佳实践?

The file will probably be about 1Mb in size, and will consist of thousands of entries along the lines of; 该文件的大小可能约为1Mb,并且将包含数千个条目;

Entry
{
    property1=value1
    property2=value2
    ...
}

etc. 等等

My first instinct is to use regular expressions, but I have no prior experience of using Java in a production environment, and so am unsure how powerful the java.util.regex classes are. 我的第一直觉是使用正则表达式,但我之前没有在生产环境中使用Java的经验,所以我不确定java.util.regex类有多强大。

To clarify a bit, my application is going to be a web app (JSP) which parses the file in question and displays the various values it retrieves. 为了澄清一点,我的应用程序将成为一个Web应用程序(JSP),它解析有问题的文件并显示它检索的各种值。 There is only ever the one file which gets parsed (it resides in a 3rd party directory on the host). 只有一个文件被解析(它驻留在主机上的第三方目录中)。

The app will have a fairly low usage (maybe only a handful of users using it a couple of times a day), but it is vital that when they do use it, the information is retrieved as quickly as possible. 该应用程序的使用率相当低(可能只有少数用户每天使用它几次),但至关重要的是,当他们使用它时,会尽快检索信息。

Also, are there any precautions to take around loading the file into memory every time it is parsed? 另外,每次解析文件时,是否有任何预防措施可以将文件加载到内存中?

Can anyone recommend an approach to take here? 谁能推荐一种方法来接受这里?

Thanks 谢谢

If it's going to be about 1MB and literally in the format you state, then it sounds like you're overengineering things. 如果它大概是1MB并且按你所声明的格式,那么听起来你就是过度工程。

Unless your server is a ZX Spectrum or something, just use regular expressions to parse it, whack the data in a hash map (and keep it there), and don't worry about it. 除非您的服务器是ZX Spectrum或其他东西,只需使用正则表达式来解析它,敲击哈希映射中的数据(并将其保留在那里),并且不用担心它。 It'll take up a few megabytes in memory, but so what...? 它会占用几兆内存,但那又是什么......?

Update: just to give you a concrete idea of performance, some measurements I took of the performance of String.split() (which uses regular expressions) show that on a 2GHz machine, it takes milliseconds to split 10,000 100-character strings (in other words, about 1 megabyte of data -- actually nearer 2MB in pure volume of bytes, since Strings are 2 bytes per char). 更新:为了让您对性能有一个具体的了解,我对String.split() (使用正则表达式)的性能进行了一些测量,结果显示在2GHz机器上, 分割10,000个100个字符的字符串需要几毫秒换句话说,大约1兆字节的数据 - 实际上在纯字节量中接近2MB,因为字符串是每个字符2个字节)。 Obvioualy, that's not quite the operation you're performing, but you get my point: things aren't that bad... 很明显,这不是你正在进行的操作,但你明白我的意思:事情并不是那么糟糕......

If it is a proper grammar, use a parser builder such as the GOLD Parsing System . 如果它是正确的语法,请使用解析器构建器,例如GOLD解析系统 This allows you to specify the format and use an efficient parser to get the tokens you need, getting error-handling almost for free. 这允许您指定格式并使用有效的解析器来获取所需的令牌,几乎可以免费获得错误处理。

I'm wondering why this isn't in XML, and then you could leverage off the available XML tooling. 我想知道为什么这不是XML,然后你可以利用可用的XML工具。 I'm thinking particularly of SAX, in which case you could easily parse/process this without holding it all in memory. 我特别想到SAX,在这种情况下,您可以轻松地解析/处理它而不必将其全部保存在内存中。

So can you convert this to XML ? 那你可以把它转换成XML吗?

If you can't, and you need a parser, then take a look at JavaCC 如果你不能,并且你需要一个解析器,那么看看JavaCC

Use the Scanner class and process your file a line at a time. 使用Scanner类并一次处理一行文件。 Im not sure why you mentioned regex. 我不确定你为什么提到正则表达式。 Regex is almost never the right answer to any parsing question because of the ambiguity and lack of symmantic contorl over whats happening in what context. 正则表达式几乎永远不是任何解析问题的正确答案,因为在什么情境下发生的模糊性和缺乏语法控制。

您可以使用Antlr解析器生成器来构建能够解析文件的解析器。

Not answering the question about parsing ... but you could parse the files and generate static pages as soon as new files arrive. 没有回答关于解析的问题......但是你可以在新文件到达时解析文件并生成静态页面。 So you would have no performance problems... (And I think 1Mb isn't a big file so you can load it in memory, as long as you don't load too many files concurrently...) 所以你没有性能问题...(我认为1Mb不是一个大文件所以你可以将它加载到内存中,只要你不同时加载太多文件......)

This seems like a simple enough file format, so you may consider using a Recursive Descent Parser . 这看起来像一个简单的文件格式,因此您可以考虑使用递归下降解析器 Compared to JavaCC and Antlr, its pros are that you can write a few simple methods, get the data you need, and you do not need to learn a parser generator formalism. 与JavaCC和Antlr相比,它的优点是你可以编写一些简单的方法,获得所需的数据,而不需要学习解析器生成器的形式。 Its cons - it may be less efficient. 它的缺点 - 可能效率较低。 A recursive descent parser is in principle stronger than regular expressions. 递归下降解析器原则上比正则表达式更强。 If you can come up with a grammar for this file type, it will serve you for whatever solution you choose. 如果您可以为此文件类型提供语法,它将为您提供所选的任何解决方案。

If it's the limitations of Java regexes you're wondering about, don't worry about it. 如果这是您想知道的Java正则表达式的限制,请不要担心。 Assuming you're reasonably competent at crafting regexes, performance shouldn't be a problem. 假设你有能力制作正则表达式,性能应该不是问题。 The feature set is satisfyingly rich, too--including my favorite, possessive quantifiers . 功能集也非常丰富 - 包括我最喜欢的占有量词

the other solution is to do some form of preprocessing (done offline, or as a cron job) which produces a very optimized data structure, which is then used to serve the many web requests (without having to reparse the file). 另一种解决方案是进行某种形式的预处理(离线完成或作为cron作业),它产生一个非常优化的数据结构,然后用于服务许多Web请求(无需重新解析文件)。

though, looking at the scenario in question, that doesnt seem to be needed. 但是,看看有问题的情景,似乎并不需要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM