简体   繁体   English

C ++在特定行号的文件中插入一行

[英]C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted. 我希望能够从未排序的源文本文件(每行中的一个记录)中读取,并通过指定应插入的行号将行/记录插入目标文本文件。

Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. 将行/记录插入目标文件的位置将通过将传入文件的传入行与目标文件中已排序的列表进行比较来确定。 (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.) (目标文件将作为空文件启动,当程序迭代传入的文件行时,数据将被一次一行地排序并插入其中。)

Incoming File Example: 传入文件示例:

1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data

Desired Destination File Example: 期望的目标文件示例:

2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data

I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. 我可以通过链接列表或类似的方式在内存中执行排序,但我想允许它扩展到非常大的文件。 (And I am having fun trying to solve this problem as I am a C++ newbie :).) (我很乐意尝试解决这个问题,因为我是C ++新手:)。)

One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :). 其中一种方法可能是使用fstream打开2个文件流(1个输入和1个输出,或者只输入1个输入/输出流),然后我遇到难以找到并搜索文件位置的困难,因为它似乎取决于文件开头的绝对位置而不是行号:)。

I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice. 我确信之前已经解决过这样的问题了,我很感激如何以一种良好的做法进行的建议。

I'm using Visual Studio 2008 Pro C++, and I'm just learning C++. 我正在使用Visual Studio 2008 Pro C ++,而我只是在学习C ++。

If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go. 如果文件只是一个纯文本文件,那么我担心找到一个特定编号行的唯一方法就是在你去的时候走文件计数行。

The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file. 执行您尝试执行的操作的通常“非内存”方式是将文件从原始文件复制到临时文件,在正确的位置插入数据,然后重命名/替换原始文件。

Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more. 显然,一旦你完成了插入操作,就可以将文件的其余部分复制到一个大块中,因为你不再关心计算行了。

A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. [clear-no-c ++]解决方案是使用* nix sort工具,对第二列数据进行排序。 It might look something like this: 它可能看起来像这样:

cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>

It's not exactly in-place, and it fails the request of using C++, but it does work :) 它不完全就位,它没有使用C ++的请求,但它确实有效:)

Might even be able to do: 甚至可能做到:

cat <file> | sort -k 2,2 > <file>

I haven't tried that route, though. 不过,我没有尝试过这条路线。
* http://www.ss64.com/bash/sort.html - sort man page * http://www.ss64.com/bash/sort.html - 排序手册页

One way to do this is not to keep the file sorted, but to use a separate index, using berkley db ( BerkleyDB ). 一种方法是使用berkley db( BerkleyDB )来保持文件排序,而是使用单独的索引。 Each record in the db has the sort keys, and the offset into the main file. db中的每条记录都有排序键和主文件的偏移量。 The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. 这样做的好处是您可以有多种排序方式,而无需复制文本文件。 You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. 您还可以通过在末尾附加更改的行来更改行而不重写文件,并更新索引以忽略旧行并指向新行。 We used this successfully for multi-GB text files that we had to make many small changes to. 我们成功地将其用于多GB文本文件,我们必须对其进行许多小的更改。

Edit: The code I developed to do this is part of a larger package that can be downloaded here . 编辑:我开发的代码是一个更大的包的一部分,可以在这里下载。 The specific code is in the btree* files under source/IO. 具体代码位于source / IO下的btree *文件中。

The basic problem is that under common OSs, files are just streams of bytes. 基本问题是在常见操作系统下,文件只是字节流。 There is no concept of lines at the filesystem level. 文件系统级别没有行的概念。 Those semantics have to be added as an additional layer on top of the OS provided facilities. 必须在OS提供的工具之上添加这些语义作为附加层。 Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. 虽然我从未使用它,但我相信VMS有一个面向记录的文件系统,可以让你想要做的更容易。 But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. 但是在Linux或Windows下,如果不重写文件的其余部分,则无法插入文件的中间。 It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top. 它类似于内存:在最高级别,它只是一个字节序列,如果你想要一些更复杂的东西,比如一个链表,它必须添加在顶部。

I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets. 我认为问题更多的是实现而不是特定的算法,特别是处理非常大的数据集。

Suppose the source file has 2^32 lines of data. 假设源文件有2 ^ 32行数据。 What would be an efficent way to sort the data. 什么是一种有效的数据排序方式。

Here's how I'd do it: 这是我如何做到的:

  1. Parse the source file and extract the following information: sort key, offset of line in file, length of line. 解析源文件并提取以下信息:排序键,文件中行的偏移量,行长度。 This information is written to another file. 此信息将写入另一个文件。 This produces a dataset of fixed size elements that is easy to index, call it the index file. 这将生成一个易于索引的固定大小元素的数据集,将其称为索引文件。

  2. Use a modified merge sort. 使用修改后的合并排序。 Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. 递归划分索引文件,直到要排序的元素数达到某个最小量 - 真正的合并排序递归到1或0个元素,我建议停在1024或者什么,这需要微调。 Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk. 将索引文件中的数据块加载到内存中并对其执行快速排序,然后将数据写回磁盘。

  3. Perform the merge on the index file. 在索引文件上执行合并。 This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). 这很棘手,但可以这样做:从每个源加载一个数据块(比方说1024个条目)。 Merge into a temporary output file and write. 合并到临时输出文件并写入。 When a block is emptied, refill it. 清空块时,重新填充。 When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. 当找不到更多源数据时,从头开始读取临时文件并覆盖要合并的两个部分 - 它们应该是相邻的。 Obviously, the final merge doesn't need to copy the data (or even create a temporary file). 显然,最终合并不需要复制数据(甚至创建临时文件)。 Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean). 考虑到这一步,可能可以为合并的索引文件设置命名约定,以便数据不需要覆盖未合并的数据(如果你看到我的意思)。

  4. Read the sorted index file and pull out from the source file the line of data and write to the result file. 读取已排序的索引文件,并从源文件中提取数据行并写入结果文件。

It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. 对于所有文件的读写,它肯定不会很快,但应该是非常有效的 - 真正的杀手是在最后一步中随机寻找源文件。 Up to that point, the disk access is usually linear and should therefore be reasonably efficient. 到目前为止,磁盘访问通常是线性的,因此应该合理有效。

Try a modifed Bucket Sort . 尝试修改的铲斗排序 Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. 假设id值很适合它,你将获得一个更有效的排序算法。 You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. 您可以通过在扫描时实际写出存储桶(使用小存储桶)来提高I / O效率,从而可能减少所需的随机文件/数量。 Or not. 或不。

Hopefully, there are some good code examples on how to insert a record based on line number into the destination file. 希望有一些关于如何将基于行号的记录插入目标文件的良好代码示例。

You can't insert contents into a middle of the file (ie, without overwriting what was previously there); 你不能将内容插入到文件的中间(即,不覆盖之前的内容); I'm not aware of production-level filesystems that support it. 我不知道支持它的生产级文件系统。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM