简体   繁体   English

Java中的文本文件解析

[英]Text File Parsing in Java

I am reading in a text file using FileInputStream that puts the file contents into a byte array.我正在使用FileInputStream读取文本文件,该文件将文件内容放入一个字节数组中。 I then convert the byte array into a String using new String(byte).然后我使用 new String(byte) 将字节数组转换为字符串。

Once I have the string I'm using String.split("\\n") to split the file into a String array and then taking that string array and parsing it by doing a String.split(",") and hold the contents in an Arraylist.一旦我有了字符串,我就使用String.split("\\n")将文件拆分为一个字符串数组,然后将该字符串数组取出并通过执行String.split(",")解析它并保存内容在一个数组列表中。

I have a 200MB+ file and it is running out of memory when I start the JVM up with a 1GB of memory.我有一个200MB 以上的文件,当我用 1GB 的内存启动 JVM 时,它的内存不足。 I know I must be doing something in correctly somewhere, I'm just not sure if the way I'm parsing is incorrect or the data structure I'm using.我知道我必须在某个地方正确地做某事,我只是不确定我解析的方式或我使用的数据结构是否不正确。

It is also taking me about 12 seconds to parse the file which seems like a lot of time.解析文件也需要我大约 12 秒,这似乎需要很多时间。 Can anyone point out what I may be doing that is causing me to run out of memory and what may be causing my program to run slow?任何人都可以指出我可能正在做的导致我内存不足的事情以及可能导致我的程序运行缓慢的原因吗?

The contents of the file look as shown below:文件内容如下图所示:

"12334", "100", "1.233", "TEST", "TEXT", "1234"
"12334", "100", "1.233", "TEST", "TEXT", "1234"
.
.
.
"12334", "100", "1.233", "TEST", "TEXT", "1234"

Thanks谢谢

I'm not sure how efficient it is memory-wise, but my first approach would be using a Scanner as it is incredibly easy to use:我不确定它在内存方面的效率如何,但我的第一种方法是使用Scanner,因为它非常容易使用:

File file = new File("/path/to/my/file.txt");
Scanner input = new Scanner(file);

while(input.hasNext()) {
    String nextToken = input.next();
    //or to process line by line
    String nextLine = input.nextLine();
}

input.close();

Check the API for how to alter the delimiter it uses to split tokens.检查 API 以了解如何更改用于拆分令牌的分隔符。

It sounds like you're doing something wrong to me - a whole lotta object creation going on.听起来你对我做错了 - 一个完整的对象创建正在进行中。

How representative is that "test" file?该“测试”文件的代表性如何? What are you really doing with that data?你真的用这些数据做什么? If that's typical of what you really have, I'd say there's lots of repetition in that data.如果这是您真正拥有的典型数据,我会说该数据中有很多重复。

If it's all going to be in Strings anyway, start with a BufferedReader to read each line.如果无论如何都将在字符串中,请从 BufferedReader 开始读取每一行。 Pre-allocate that List to a size that's close to what you need so you don't waste resources adding to it each time.将该列表预先分配到接近您需要的大小,这样您就不会浪费每次添加到它的资源。 Split each of those lines at the comma;在逗号处拆分每一行; be sure to strip off the double quotes.一定要去掉双引号。

You might want to ask yourself: "Why do I need this whole file in memory all at once?"您可能会问自己:“为什么我需要一次性将整个文件保存在内存中?” Can you read a little, process a little, and never have the whole thing in memory at once?你能读一点,处理一点,而且永远不会一次把整个事情都记在记忆里吗? Only you know your problem well enough to answer.只有您足够了解您的问题才能回答。

Maybe you can fire up jvisualvm if you have JDK 6 and see what's going on with memory.如果您有 JDK 6,也许您可​​以启动 jvisualvm 并查看内存发生了什么。 That would be a great clue.那将是一个很好的线索。

Have a look at these pages.看看这些页面。 They contain many open source CSV parsers.它们包含许多开源 CSV 解析器。 JSaPar is one of them. JSaPar就是其中之一。

It sounds like you currently have 3 copies of the entire file in memory: the byte array, the string, and the array of the lines.听起来您目前在内存中拥有整个文件的 3 个副本:字节数组、字符串和行数组。

Instead of reading the bytes into a byte array and then converting to characters using new String() it would be better to use an InputStreamReader, which will convert to characters incrementally, rather than all up-front.与其将字节读入字节数组,然后使用new String()转换为字符,不如使用 InputStreamReader,它会逐步转换为字符,而不是预先全部转换为字符。

Also, instead of using String.split("\\n") to get the individual lines, you should read one line at a time.此外,不应使用 String.split("\\n") 来获取各行,而应一次读取一行。 You can use the readLine() method in BufferedReader .您可以在BufferedReader使用readLine()方法。

Try something like this:尝试这样的事情:

BufferedReader reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"));
try {
  while (true) {
    String line = reader.readLine();
    if (line == null) break;
    String[] fields = line.split(",");
    // process fields here
  }
} finally {
  reader.close();
}

If you have a 200,000,000 character files and split that every five characters, you have 40,000,000 String objects.如果您有一个 200,000,000 个字符的文件并每五个字符拆分一次,那么您将拥有 40,000,000 个String对象。 Assume they are sharing actual character data with the original 400 MB String ( char is 2 bytes).假设他们与原始 400 MB Stringchar是 2 个字节)共享实际字符数据。 A String is say 32 bytes, so that is 1,280,000,000 bytes of String objects.一个String是 32 个字节,所以是 1,280,000,000 个字节的String对象。

(It's probably worth noting that this is very implementation dependent. split could create entirely strings with entirely new backing char[] or, OTOH, share some common String values. Some Java implementations to not use the slicing of char[] . Some may use a UTF-8-like compact form and give very poor random access times.) (可能值得注意的是,这非常依赖于实现split可以创建具有全新支持char[]完全字符串,或者,OTOH,共享一些常见的String值。一些 Java 实现不使用char[]的切片。有些可能会使用类似于 UTF-8 的紧凑形式,并且随机访问时间很短。)

Even assuming longer strings, that's a lot of objects.即使假设更长的字符串,那也是很多对象。 With that much data, you probably want to work with most of it in compact form like the original (only with indexes).有了这么多数据,您可能希望像原始数据一样以紧凑的形式处理其中的大部分(仅使用索引)。 Only convert to objects that which you need.仅转换为您需要的对象。 The implementation should be database like (although they traditionally don't handle variable length strings efficiently).实现应该像数据库一样(尽管它们传统上不能有效地处理可变长度的字符串)。

While calling/invoking your programme you can use this command : java [-options] className [args...]在调用/调用您的程序时,您可以使用以下命令:java [-options] className [args...]
in place of [-options] provide more memory eg -Xmx1024m or more.代替 [-options] 提供更多内存,例如 -Xmx1024m 或更多。 but this is just a workaround, u have to change ur parsing mechanism.但这只是一种解决方法,您必须更改解析机制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM