简体繁体 English

如何使用Java在文件系统上使用索引管理文件

[英]how to manage files with indexes on a file system using java

原文 2012-02-20 22:44:24 7 1 java/ indexing/ filesystems/ nosql

I am planning to develop a server application to support and handle hign volume data migration. 我打算开发一个服务器应用程序以支持和处理高容量数据迁移。

Imagine this as a queue based platform where the client program (source agents that pulls metadata from a content management system) will send data packets (approxialtely 1KB size) to server and the server will store these packets in its designated file system. 想象一下，这是一个基于队列的平台，客户端程序（从内容管理系统中提取元数据的源代理）将向服务器发送数据包（大小约为1KB），服务器将这些数据包存储在指定的文件系统中。

The server will categorize the data packet based on some of the header information from the data packet and should be able to retrieve and return approproiate data package when it is queried using some of the header information. 服务器将根据数据包中的某些标头信息对数据包进行分类，并且在使用某些标头信息查询数据包时，服务器应能够检索并返回适当的数据包。

We can easily perform this with standard DBMS if the metadata are properly defined but in my case the packet header information will change over a period of time and I don't want to redesign my database frequently. 如果正确定义了元数据，我们可以使用标准DBMS轻松地执行此操作，但是在我的情况下，数据包头信息将在一段时间内发生变化，因此我不想频繁地重新设计数据库。

The challenge that I am seeing here is to store the packet files in a file system effeciently (so that it wont affect the file server performance) and also maintain an indexing information that can be used to locate the appropriate packets when requested. 我在这里看到的挑战是将数据包文件有效地存储在文件系统中（这样就不会影响文件服务器的性能），并且还维护索引信息，该信息可用于在请求时查找合适的数据包。

I am thinking about using any non-DBMS open source framework (java based - nosql??) that can serve the above mentioned purpose. 我正在考虑使用可以满足上述目的的任何非DBMS开源框架（基于Java-nosql ??）。 The number of packets can range from few hunder thousands to several million based on volume of the source repository. 根据源存储库的数量，数据包的数量可以从几千到几百万到几百万不等。

Appreciate your inputs. 感谢您的投入。

1 个解决方案

A column-oriented database such as Apache Cassandra could handle this scenario - the indexing provided in Cassandra is relatively basic, but would probably be OK for your scenario. 诸如Apache Cassandra之类的面向列的数据库可以处理这种情况-Cassandra中提供的索引是相对基本的，但是对于您的情况来说可能是可以的。 Several million 1KB values would be a pretty small dataset for Cassandra and should be no problem at all. 对于Cassandra而言，数百万个1KB的值将是一个非常小的数据集，应该完全没有问题。

Additional metadata columns could be written alongside the main data packets; 可以在主数据包旁边写入其他元数据列； the column names can be decided on-the-fly if desired, so this would allow your header format to evolve. 可以根据需要即时确定列名，因此可以使标题格式得以发展。

The data in Cassandra is collected in in-memory tables before being written to disk in immutable "SSTables" in an efficient manner. Cassandra中的数据收集在内存表中，然后以有效的方式在不可变的“ SSTables”中写入磁盘。 It's also written immediately to a commitlog to provided durability in case of crashes etc. 它还会立即写入提交日志，以确保在发生崩溃等情况下的持久性。