sql 或 python 有什么快速的方法吗？

Question

I have a dataset of size 1TB containing 3 columns and about 20 billion rows.我有一个大小为1TB的数据集，其中包含 3 列和大约 200 亿行。 I would like to split this data in some random order into two sub datas in approximately 80/20 chunks.我想以某种随机顺序将这些数据分成大约 80/20 块的两个子数据。 However, the two data should be non-overlapping meaning no entry in one chunk should appear in another chunk.但是，这两个数据应该是非重叠的，这意味着一个块中的条目不应出现在另一个块中。 An entry in one column of one chunk should not appear in any column of the other chunk.As an example, suppose an example data is:一个块的一列中的条目不应出现在另一块的任何列中。例如，假设示例数据是：

fruit apple seeds
vegetable carrot yellow
crops fruit lettuce
green onion vegetable
lettuce red health

The two subdata can be两个子数据可以是

fruit apple seeds
crops fruit lettuce
lettuce red health

and和

vegetable carrot yellow
green onion vegetable

Is there any efficient way to do this for such a large data?对于如此大的数据，有什么有效的方法可以做到这一点吗？

Answer 1

You can just iterate over the file and randomly assign rows to sub-data-1 and sub-data-2 according to the proportions you've laid out.您可以遍历文件并根据您布置的比例将行随机分配给 sub-data-1 和 sub-data-2。

import random
with open('large_file', 'r') as lf, 
open('s1', 'w') as s1, open('s2', 'w') as s2:
    for line in lf:
        if random.random() < 0.8:
            s1.write(line)
        else:
            s2.write(line)

sql 或 python 有什么快速的方法吗？

问题描述

1 个解决方案

解决方案1
-1 2022-11-18 01:29:43

sql 或 python 有什么快速的方法吗？

问题描述

1 个解决方案

解决方案1 -1 2022-11-18 01:29:43

解决方案1
-1 2022-11-18 01:29:43