简体   繁体   English

读取/写入Java中的大文件

[英]read/write to a large size file in java

i have a binary file with following format : 我有以下格式的二进制文件:

[N bytes identifier & record length] [n1 bytes data] 
[N bytes identifier & record length] [n2 bytes data] 
[N bytes identifier & record length] [n3 bytes data]

as you see i have records with different lengths. 如您所见,我有不同长度的记录。 in each record i have N bytes fixed which contains and id and the length of data in record . 在每条记录中,我固定了N个字节,其中包含和id以及记录中数据的长度

this file is very big and can contains 3 millions records. 该文件非常大,可以包含300万条记录。

I want to open this file by an application and let user to browse and edit the records. 我想通过应用程序打开此文件,并允许用户浏览和编辑记录。 ( Insert / Update / Delete records) (插入/更新/删除记录)

my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. 我的最初计划是从原始文件创建文件并为文件建立索引,对于每个记录,保留下一个和上一个记录地址,以轻松地前后导航。 (some sort of linked list but in file not in memory) (某种链表,但在文件中不在内存中)

  • is there library (java library) to help me to implement this requirement ? 是否有库(java库)可以帮助我实现此要求?

  • any recommendation or experience that you think is useful? 您认为有用的任何建议或经验?

----------------- EDIT ---------------------------------------------- -----------------编辑-------------------------------- --------------

Thanks for guides and suggestions, 感谢您的指导和建议,

some more info: 一些更多的信息:

the original file and its format is out of my control (it's a third party file) and i can't change the file format. 原始文件及其格式超出了我的控制(这是第三方文件),我无法更改文件格式。 but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format . 但是我必须阅读它,让用户浏览记录并编辑其中的一些记录(插入新记录/更新现有记录/删除记录),最后将其保存回原始文件格式

do u still recommend DataBase instead of a normal index file ? 您是否仍建议使用数据库而不是普通索引文件?

----------------- SECOND EDIT ---------------------------------------------- -----------------第二编辑------------------------------- ---------------

record size in update mode is fixed. 更新模式下的记录大小是固定的。 it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format. 这意味着更新的(编辑的)记录的长度与原始记录的长度相同,除非用户删除该记录并创建另一个具有不同格式的记录。

Many Thanks 非常感谢

Seriously, you should NOT be using a binary file for this. 严重的是,您不应为此使用二进制文件。 You should use a database. 您应该使用数据库。

The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. 尝试将其作为常规文件实现的问题源于以下事实:操作系统不允许您在现有文件的中间插入额外的字节。 So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to: 因此,如果您需要插入一条记录(除结尾处之外的任何地方),更新一条记录(使用其他大小)或删除一条记录,则需要:

  • rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or 重写其他记录(在插入/更新/删除点之后)以腾出或回收空间,或者
  • implement some kind of free space management within the file. 在文件中实现某种可用空间管理。

All of this is complicated and / or expensive. 所有这些都是复杂和/或昂贵的。

Fortunately, there is a class of software that implements this kind of thing. 幸运的是,有一类软件可以实现这种功能。 It is called database software. 它称为数据库软件。 There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files. 从使用完整的RDBMS到诸如BerkeleyDB文件的轻量级解决方案,范围广泛。


In response to your 1st and 2nd edits, a database will still be simpler. 响应您的第一次和第二次编辑,数据库将仍然更加简单。

However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management. 但是,这里有可能更好地发挥这个用例比使用DB ...没有做复杂的自由空间管理的替代品。

  1. Read the file and build an in-memory index that maps ids to file locations. 读取文件并建立一个将ID映射到文件位置的内存索引。

  2. Create a second file to hold new and updated records. 创建另一个文件来保存新的和更新的记录。

  3. Perform the record adds/updates/deletes: 执行记录的添加/更新/删除:

    1. An addition is handled by writing the new record to the end of the second file, and adding an index entry for it. 通过将新记录写入第二个文件的末尾并为其添加索引条目来处理添加。

    2. An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it. 通过将更新后的记录写入第二个文件的末尾,并更改现有索引条目以指向该文件来处理更新。

    3. A delete is handled by deleting the index entry for the record's key. 通过删除记录键的索引条目来处理删除。

  4. Compact the file as follows: 压缩文件,如下所示:

    1. Create a new file. 创建一个新文件。

    2. Read each record in the old file in order, and check the index for the record's key. 依次读取旧文件中的每条记录,并检查记录键的索引。 If the entry still points to the location of the record, copy the record to the new file. 如果条目仍然指向记录的位置,请将记录复制到新文件。 Otherwise skip it. 否则跳过它。

    3. Repeat the step 4.2 for the second file. 对第二个文件重复步骤4.2。

  5. If we completed all of the above successfully, delete the old file and second file. 如果我们成功完成了上述所有操作,请删除旧文件和第二个文件。

Note this relies on being able to keep the index in memory. 请注意,这依赖于能够将索引保留在内存中。 If that is not feasible, then the implementation is going to be more complicated ... and more like a database. 如果那不可行,那么实现将变得更加复杂……更像是数据库。

Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. 拥有数据文件和索引文件将是这种实现的基本概念,但是您会发现自己经常在重复数据更新/删除等过程中处理数据碎片。这种项目本身应该是一个单独的项目,不应属于您的主应用程序。 However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution. 但是,从本质上讲,数据库正是您所需要的,因为它是专门为此类操作和用例设计的,并且还允许您搜索,排序和扩展(更改)数据结构,而无需重构内部(自定义)解。

May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). 我可以建议您下载Apache Derby并创建一个本地嵌入式数据库(derby这样做是为了您希望在运行时创建一个新的嵌入式连接)。 It will not only be faster than anything you'll write yourself, but will make your application easier to maintain. 它不仅会比您自己编写的任何东西都要快,而且会使您的应用程序更易于维护。

Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). Apache Derby是一个jar文件,您可以简单地将其包含在项目中并随项目一起分发(如果您的应用程序中可能存在任何法律问题,请检查许可证 )。 There is no need for a database server or third party software; 无需数据库服务器或第三方软件; it's all pure Java. 全部都是纯Java。

Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc. 总而言之,这取决于应用程序的大小,是否需要在许多客户端之间共享数据,速度是否是应用程序的关键方面等。

For a stand-alone, single user project, I recommend Apache Derby. 对于独立的单用户项目,我建议使用Apache Derby。 For a n-tier application, you might want to look into MySQL , PostgreSQL or ( hrm ) even Oracle . 对于n层应用程序,您可能需要研究MySQLPostgreSQL什至Oracle的hrm )。 Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts). 使用已经制造和测试的解决方案不仅很聪明,而且可以减少您的开发时间(和维护工作)。

Cheers. 干杯。

Generally you are better off letting a library or database do the work for you. 通常,最好让库或数据库为您完成工作。

You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. 您可能不想拥有SQL数据库,并且有很多不使用SQL的简单数据库。 http://nosql-database.org/ lists 122 of them. http://nosql-database.org/列出了其中的122个。

At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work. 至少,如果要编写这篇文章,建议您阅读这些数据库之一的源代码,以了解它们如何工作。


Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible. 根据记录的大小,300万不是那么多,我建议您在内存中保留尽可能多的内存。

The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. 您可能遇到的问题是确保数据一致,并在发生损坏时恢复数据。 The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies. 第二个问题是有效地处理碎片(在GC上最聪明的人正在处理的事情)第三个问题可能是与源数据以事务方式维护索引,以确保没有不一致。

While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. 虽然这乍看起来很简单,但是要确保数据可靠,可维护并且可以有效访问,就存在很大的复杂性。 This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application. 这就是为什么大多数开发人员使用现有的数据库/数据存储库并专注于对他们的应用程序不利的功能的原因。

(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel) (注意:我的回答是关于一般问题的,不考虑任何Java库,也未考虑其他提议的问题,而是使用数据库(库),这可能比重新发明轮子要好)

The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). 创建索引的想法很好,并且在性能方面非常有用(尽管您编写了“索引文件”,但我认为应该将其保存在内存中)。 Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek. 如果您读取ID并记录每个条目的长度,然后仅使用文件搜索跳过数据,则生成索引应该会非常快。

You should also think about the edit functionality. 您还应该考虑编辑功能。 Especially inserting and deleting can be very slow on such a big file if you do it wrong (fe deleting and then moving all the following entries to close the gap). 如果您做错了,尤其是在这样大的文件上插入和删除可能会非常慢(例如删除然后移动以下所有条目以缩小间隔)。

The best option would be to only mark deleted entries as deleted. 最好的选择是仅将已删除的条目标记为已删除。 When inserting, you can overwrite one of those or append to the end of the file. 插入时,您可以覆盖其中之一或附加到文件末尾。

Insert / Update / Delete records 插入/更新/删除记录

Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. 在文件中插入(而不是仅仅添加和删除)记录非常昂贵,因为您必须移动文件的以下所有内容来为新记录创建空间或删除其使用的空间。 Updating is similarly expensive if the update changes the length of the record (you say they are variable length). 如果更新更改了记录的长度(您说它们是可变长度),则更新的代价同样昂贵。

The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. 您建议的文件格式从根本上来说不适合您要执行的操作类型。 Others have suggested using a data-base. 其他人建议使用数据库。 If you don't want to go that far, adding an index file (as you suggest) is the way to go. 如果您不想走那么远,可以按照您的建议添加索引文件。 I recommend making the index records all the same length. 我建议使索引记录的长度都相同。

As others have stated a database would seem a better solution. 正如其他人所说, 数据库似乎是一个更好的解决方案。 The following are Java SQL DB's that could be used: H2 , Derby or HSQLDB 以下是可以使用的Java SQL DB: H2DerbyHSQLDB

If you want to use an index file look at Berkley DB or No Sql 如果要使用索引文件,请查看Berkley DBNo Sql

If there is some reason for using a file, look at JRecord . 如果出于某些原因使用文件,请查看JRecord It has 它有

  1. Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). 几种用于读取/写入具有可变长度二进制记录的文件的类(在为Cobol VB文件编写的地方)。 Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job. 任何大型机/富士通/ Open Cobol VB文件结构都可以完成此工作。
  2. An Editor for editing JRecord files. 用于编辑JRecord文件的编辑器。 The latest version of the Editor can handle large files (it uses Compression / spill file). 最新版本的编辑器可以处理大文件(它使用压缩/溢出文件)。 The editor suffers from having to download the whole file and only one user can edit the file at one time. 编辑者不得不下载整个文件,并且一次只能有一个用户编辑文件。

The JRecord solution will only work if JRecord解决方案仅在以下情况下有效

  • There is a limited number (preferably one) users all located in the one location 有限的(最好是一个)用户全部位于一个位置
  • Fast infostructure 快速的信息结构

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM