简体   繁体   English

使用序列化二叉搜索树对海量数据进行排序

[英]Sorting huge volumed data using Serialized Binary Search Tree

I have 50 GB structured (as key/value) data like this which are stored in a text file (input.txt / keys and values are 63 bit unsigned integers);我有 50 GB 这样的结构化(作为键/值)数据,这些数据存储在一个文本文件中(input.txt / 键和值是 63 位无符号整数);

3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
2436941118228099529 7438724021973510085
3370171830426105971 6928935600176631582
3370171830426105971 5928936601176631564

I need to sort this data as keys in increasing order with the minimum value of that key.我需要以该键的最小值按递增顺序将此数据排序为键。 The result must be presented in another text file (data.out) under 30 minutes.结果必须在 30 分钟内呈现在另一个文本文件 (data.out) 中。 For example the result must be like this for the sample above;例如,对于上面的示例,结果必须是这样的;

2436941118228099529 7438724021973510085
3370171830426105971 5928936601176631564
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360

I decided that;我决定了;

  • I will create a BST tree with the keys from the input.txt with their minimum value, but this tree would be more than 50GB.我将使用 input.txt 中的键及其最小值创建一个 BST 树,但该树将超过 50GB。 I mean, I have time and memory limitation at this point.我的意思是,此时我有时间和内存限制。

  • So I will use another text file (tree.txt) and I will serialize the BST tree into that file.因此,我将使用另一个文本文件 (tree.txt),并将 BST 树序列化到该文件中。

  • After that, I will traverse the tree using in-order traverse and write the sequenced data into data.out file.之后,我将使用 in-order traverse 遍历树并将排序后的数据写入 data.out 文件。

My problem is mostly with the serialization and deserialization part.我的问题主要是序列化和反序列化部分。 How can I serialize this type of data?如何序列化这种类型的数据? and I want to use the INSERT operation on the serialized data.我想对序列化数据使用 INSERT 操作。 Because my data is bigger than memory.因为我的数据比内存大。 I can't perform this in the memory.我无法在内存中执行此操作。 Actually I want to use text files as a memory.其实我想用文本文件作为内存。

By the way, I am very new to this kind of stuffs.顺便说一下,我对这种东西很陌生。 If is there a conflict with my algorithm steps, please warn me.如果与我的算法步骤有冲突,请警告我。 Any thought, technique and code samples would be helpful.任何想法、技术和代码示例都会有所帮助。

OS: Linux
Language: C
RAM: 6 GB

Note: I am not allowed to use built-in functions like sort and merge.注意:我不允许使用排序和合并等内置函数。

Considering, that your files seems to have the same line size around 40 chars giving me around 1250000000 lines in total, I'd split the input file into smaller, by a command:考虑到您的文件似乎具有大约 40 个字符的相同行大小,总共给我大约 1250000000 行,我将通过以下命令将输入​​文件拆分成更小的文件:

split -l 2500000 biginput.txt

then I'd sort each of them然后我会对他们每个人进行排序

for f in x*; do sort -n $f > s$f; done

and finally I'd merge them by最后我将它们合并

sort -m sx* > bigoutput.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM