如何快速创建具有“自然”内容的大型（> 1gb）文本+二进制文件？（C＃）

Question

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats. 为了测试压缩，我需要能够创建大文件，最好是文本，二进制和混合格式。

The content of the files should be neither completely random nor uniform. 文件的内容不应完全随机，也不能统一。
A binary file with all zeros is no good. 全零的二进制文件是不好的。 A binary file with totally random data is also not good. 具有完全随机数据的二进制文件也不是很好。 For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). 对于文本，具有完全随机的ASCII序列的文件不是很好-文本文件应具有模拟自然语言或源代码（XML，C＃等）的模式和频率。 Pseudo-real text. 伪真实文本。
The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb. 每个文件的大小并不重要，但是对于文件集，我需要总数为〜8gb。
I'd like to keep the number of files at a manageable level, let's say o(10). 我想将文件数量保持在可管理的水平，比方说o（10）。

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this: 为了创建二进制文件，我可以新建一个大缓冲区并在循环中依次执行System.Random.NextBytes和FileStream.Write，如下所示：

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. 有了足够大的缓冲区（例如512k），即使对于2或3gb以上的文件，这也相对较快。 But the content is totally random, which is not what I want. 但是内容完全是随机的，这不是我想要的。

For text files, the approach I have taken is to use Lorem Ipsum , and repeatedly emit it via a StreamWriter into a text file. 对于文本文件，我采用的方法是使用Lorem Ipsum ，并通过StreamWriter反复将其发射到文本文件中。 The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. 内容是非随机且不一致的，但是它确实有许多相同的重复块，这是不自然的。 Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time. 另外，由于Lorem Ispum块非常小（<1k），因此需要很多循环，并且需要非常非常长的时间。

Neither of these is quite satisfactory for me. 这些对我都不满意。

I have seen the answers to Quickly create large file on a windows system? 我已经看到在Windows系统上快速创建大文件的答案了吗？ . 。 Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. 这些方法非常快速，但是我认为它们只是用零或随机数据填充文件，而这都不是我想要的。 I have no problem with running an external process like contig or fsutil, if necessary. 如果有必要，我可以运行诸如contig或fsutil之类的外部进程没有问题。

The tests run on Windows. 测试在Windows上运行。
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? 而不是创建新文件，仅使用文件系统中已经存在的文件有意义吗？ I don't know of any that are sufficiently large. 我不知道有足够大的东西。

What about starting with a single existing file (maybe c:\\windows\\Microsoft.NET\\Framework\\v2.0.50727\\Config\\enterprisesec.config.cch for a text file) and replicating its content many times? 从一个现有文件（可能是文本文件的c：\\ windows \\ Microsoft.NET \\ Framework \\ v2.0.50727 \\ Config \\ enterprisesec.config.cch）开始并多次复制其内容又如何呢？ This would work with either a text or binary file. 这将适用于文本文件或二进制文件。

Currently I have an approach that sort of works but it takes too long to run. 目前，我有一种可以完成此类工作的方法，但是运行时间太长。

Has anyone else solved this? 还有其他人解决吗？

Is there a much faster way to write a text file than via StreamWriter? 是否有比通过StreamWriter更快的方法来编写文本文件？

Suggestions? 有什么建议吗？

EDIT : I like the idea of a Markov chain to produce a more natural text. 编辑：我喜欢马尔可夫链产生更自然文本的想法。 Still need to confront the issue of speed, though. 不过，仍然需要面对速度问题。

Answer 1

For text, you could use the stack overflow community dump , there is 300megs of data there. 对于文本，您可以使用堆栈溢出社区转储，那里有300兆数据。 It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in). 使用我编写的应用程序将其加载到数据库中仅需6分钟，并且可能大约在同一时间将所有帖子转储到文本文件中，这很容易为您提供200K到100万个文本文件之间的任意位置，具体取决于您的处理方式（还有将源代码和xml混合在一起的额外好处）。

You could also use something like the wikipedia dump , it seems to ship in MySQL format which would make it super easy to work with. 您也可以使用Wikipedia dump之类的东西，它似乎以MySQL格式提供，这使它使用起来非常容易。

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally. 如果您正在寻找可以拆分的大文件（出于二进制目的），则可以在本地使用VM vmdk或DVD。

Edit 编辑

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent . 马克提到了gutenberg下载项目，这也是一个非常好的文本（和音频）来源，可以通过bittorrent下载。

Answer 2

You could always code yourself a little web crawler... 您总是可以给自己编写一个小的网络爬虫...

UPDATE Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long". 更新冷静的家伙们，如果他没有说他已经有一个“花费太长时间”的解决方案，那将是一个很好的答案。

A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time. 在此处进行快速检查似乎表明下载8GB的内容将花费相对较长的时间。

Answer 3

I think you might be looking for something like a Markov chain process to generate this data. 我认为您可能正在寻找类似马尔可夫链的过程来生成此数据。 It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine . 它既是随机的（随机的），又是结构化的，因为它基于有限状态机运行。

Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. 实际上，马尔可夫链已用于生成人类语言中的半真实外观文本。 In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. 通常，它们不是要进行正确分析的琐碎的事情，但是它们具有某些属性的事实对您来说应该足够好。 (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. （同样，请参见页面的“马尔可夫链的属性”部分。）希望您应该看到如何设计一个，但是要实现，它实际上是一个非常简单的概念。 Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. 最好的选择可能是为通用的马尔可夫过程创建一个框架，然后分析自然语言或源代码（无论您希望模拟随机数据的哪个），以“训练”您的马尔可夫过程。 In the end, this should give you very high quality data in terms of your requirements. 最后，这将根据您的需求为您提供高质量的数据。 Well worth going to the effort, if you need these enormous lengths of test data. 如果您需要大量的测试数据，那值得付出努力。

Answer 4

I think the Windows directory will probably be a good enough source for your needs. 我认为Windows目录可能会满足您的需求。 If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file. 如果您想输入文本，我将遍历每个目录以查找.txt文件，并循环遍历它们，以根据需要将它们复制到您的输出文件中多次，以获得合适的文件大小。

You could then use a similiar approach for binary files by looking for .exes or .dlls. 然后，您可以通过查找.exes或.dlls，对二进制文件使用类似的方法。

Answer 5

For text files you might have some success taking an english word list and simply pulling words from it at random. 对于文本文件，采用英文单词列表并从中随机抽取单词可能会有些成功。 This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english. 这不会产生真实的英语文本，但我想它将产生与您在英语中发现的字母频率相似的字母频率。

For a more structured approach you could use a Markov chain trained on some large free english text. 对于更结构化的方法，您可以使用在一些大型免费英语文本上训练的马尔可夫链。

Answer 6

Why don't you just take Lorem Ipsum and create a long string in memory before your output. 您为什么不只接受Lorem Ipsum并在输出之前在内存中创建一个长字符串。 The text should expand at a rate of O(log n) if you double the amount of text you have every time. 如果您每次都将文本量加倍，则文本应以O（log n）的速率扩展。 You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array. 您甚至可以事先计算数据的总长度，从而不必将内容复制到新的字符串/数组中。

Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. 由于缓冲区只有512k或设置为512k的任何值，因此您只需要在写入之前生成那么多数据，因为那只是您一次可以推入文件的数量。 You are going to be writing the same text over and over again, so just use the original 512k that you created the first time. 您将要一遍又一遍地写相同的文本，因此只需使用您第一次创建的原始512k。

Answer 7

Wikipedia is excellent for compression testing for mixed text and binary. Wikipedia非常适合用于混合文本和二进制文件的压缩测试。 If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. 如果您需要进行基准比较，那么Hutter奖网站可以为Wikipedia的前100mb提供高水位标记。 The current record is a 6.26 ratio, 16 mb. 当前记录是6.26的比率，即16 mb。

Answer 8

Thanks for all the quick input. 感谢您的快速输入。 I decided to consider the problems of speed and "naturalness" separately. 我决定分别考虑速度和“自然”问题。 For the generation of natural-ish text, I have combined a couple ideas. 为了生成自然的文本，我结合了一些想法。

To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff. 为了生成文本，我首先从Mark Gushenbergoff建议的项目gutenberg目录中获取一些文本文件。
I randomly select and download one document of that subset. 我随机选择并下载该子集的一个文档。
I then apply a Markov Process, as suggested by Noldorin , using that downloaded text as input. 然后，按照Noldorin的建议，应用马尔可夫过程，使用下载的文本作为输入。
I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. 我以Pike经济的Perl实现为例，用C＃编写了新的马尔可夫链。 It generates a text one word at a time. 它一次生成一个单词的文本。
For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together. 为了提高效率，该代码而不是使用纯马尔可夫链一次生成1gb的文本一个单词，而是生成了〜1mb的随机文本，然后重复获取该文本的随机片段并将它们聚集在一起。

UPDATE : As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. 更新：关于第二个问题，速度-我采用了消除尽可能多的IO的方法，这是在我的5400rpm微型主轴的差劲笔记本电脑上完成的。 Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. 这使我完全重新定义了问题-我真正想要的是随机内容，而不是生成带有随机内容的FILE 。 Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. 使用环绕在马尔可夫链上的流，我可以在内存中生成文本并将其流传输到压缩器，从而消除了8g的写入和8g的读取。 For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. 对于此特定测试，我不需要验证压缩/解压缩往返行程，因此不需要保留原始内容。 So the streaming approach worked well to speed things up massively. 因此，流式传输方法很好地加快了速度。 It cut 80% of the time required. 它节省了80％的时间。

I haven't yet figured out how to do the binary generation, but it will likely be something analogous. 我还没有弄清楚如何进行二进制生成，但是可能很相似。

Thank you all, again, for all the helpful ideas. 再次感谢大家提出的所有有益建议。

如何快速创建具有“自然”内容的大型（> 1gb）文本+二进制文件？（C＃）

问题描述

8 个解决方案

解决方案1
14 2009-06-24 11:19:26

解决方案2
10 2009-06-24 11:16:50

解决方案3
4 已采纳 2009-06-24 11:18:10

解决方案4
3 2009-06-24 11:31:33

解决方案5
1 2009-06-24 11:20:06

解决方案6
1 2009-06-24 11:20:21

解决方案7
1 2009-06-24 14:02:43

解决方案8
0 2009-06-24 18:39:44

如何快速创建具有“自然”内容的大型（&gt; 1gb）文本+二进制文件？ （C＃）

问题描述

8 个解决方案

解决方案1 14 2009-06-24 11:19:26

解决方案2 10 2009-06-24 11:16:50

解决方案3 4 已采纳 2009-06-24 11:18:10

解决方案4 3 2009-06-24 11:31:33

解决方案5 1 2009-06-24 11:20:06

解决方案6 1 2009-06-24 11:20:21

解决方案7 1 2009-06-24 14:02:43

解决方案8 0 2009-06-24 18:39:44

如何快速创建具有“自然”内容的大型（> 1gb）文本+二进制文件？（C＃）

解决方案1
14 2009-06-24 11:19:26

解决方案2
10 2009-06-24 11:16:50

解决方案3
4 已采纳 2009-06-24 11:18:10

解决方案4
3 2009-06-24 11:31:33

解决方案5
1 2009-06-24 11:20:06

解决方案6
1 2009-06-24 11:20:21

解决方案7
1 2009-06-24 14:02:43

解决方案8
0 2009-06-24 18:39:44