简体   繁体   English

如何从一个大文件中随机删除一些行?

[英]How to randomly delete a number of lines from a big file?

I have a big text file of 13 GB with 158,609,739 lines and I want to randomly select 155,000,000 lines. 我有一个13 GB的大文本文件,158,609,739行,我想随机选择155,000,000行。

I have tried to scramble the file and then cut the 155000000 first lines, but it's seem that my ram memory (16GB) isn't enough big to do this. 我试图加扰文件,然后削减155000000第一行,但似乎我的ram内存(16GB)不足以做到这一点。 The pipelines i have tried are: 我试过的管道是:

shuf file | head -n 155000000
sort -R file | head -n 155000000

Now instead of selecting lines, I think is more memory efficient delete 3,609,739 random lines from the file to get a final file of 155000000 lines. 现在,而不是选择行,我认为是更高效的内存从文件中删除3,609,739个随机行,以获得155000000行的最终文件。

As you copy each line of the file to the output, assess its probability that it should be deleted. 在将文件的每一行复制到输出时,请评估其应删除的概率。 The first line should have a 3,609,739/158,609,739 chance of being deleted. 第一行应该有3,609,739 / 158,609,739被删除的机会。 If you generate a random number between 0 and 1 and that number is less than that ratio, don't copy it to the output. 如果生成0到1之间的随机数且该数字小于该比率,请不要将其复制到输出中。 Now the odds for the second line are 3,609,738/158,609,738; 现在第二线的赔率为3,609,738 / 158,609,738; if that line is not deleted, the odds for the third line are 3,609,738/158,609,737. 如果该行未被删除,则第三行的赔率为3,609,738 / 158,609,737。 Repeat until done. 重复直到完成。

Because the odds change with each line processed, this algorithm guarantees the exact line count. 由于每个处理线的几率都会发生变化,因此该算法可确保精确的线数。 Once you've deleted 3,609,739 the odds go to zero; 一旦你删除了3,609,739,赔率就会降到零; if at any time you would need to delete every remaining line in the file, the odds go to one. 如果您在任何时候需要删除文件中的每个剩余行,则赔率为1。

You could always pre-generate which line numbers (a list of 3,609,739 random numbers selected without replacement) you plan on deleting, then just iterate through the file and copy to another, skipping lines as necessary. 您可以随时预先生成您计划删除的行号(未经替换选择的3,609,739个随机数的列表),然后只需遍历文件并复制到另一个,根据需要跳过行。 As long as you have space for a new file this would work. 只要你有一个新文件的空间,这将是有效的。

You could select the random numbers with random.sample Eg, 您可以使用random.sample选择随机数。例如,

random.sample(xrange(158609739), 3609739)

Proof of Mark Ransom's Answer Mark Ransom答案的证明

Let's use numbers easier to think about (at least for me!): 让我们更容易思考数字(至少对我而言!):

  • 10 items 10项
  • delete 3 of them 删除3个

First time through the loop we will assume that the first three items get deleted -- here's what the probabilities look like: 第一次通过循环我们将假设前三个项目被删除 - 这是概率的样子:

  • first item: 3 / 10 = 30% 第一项:3/10 = 30%
  • second item: 2 / 9 = 22% 第二项:2/9 = 22%
  • third item: 1 / 8 = 12% 第三项:1/8 = 12%
  • fourth item: 0 / 7 = 0 % 第四项:0/7 = 0%
  • fifth item: 0 / 6 = 0 % 第五项:0/6 = 0%
  • sixth item: 0 / 5 = 0 % 第六项:0/5 = 0%
  • seventh item: 0 / 4 = 0 % 第七项:0/4 = 0%
  • eighth item: 0 / 3 = 0 % 第八项:0/3 = 0%
  • ninth item: 0 / 2 = 0 % 第九项:0/2 = 0%
  • tenth item: 0 / 1 = 0 % 第十项:0/1 = 0%

As you can see, once it hits zero, it stays at zero. 如您所见,一旦达到零,它就会保持为零。 But what if nothing is getting deleted? 但如果什么都没有被删除怎么办?

  • first item: 3 / 10 = 30% 第一项:3/10 = 30%
  • second item: 3 / 9 = 33% 第二项:3/9 = 33%
  • third item: 3 / 8 = 38% 第三项:3/8 = 38%
  • fourth item: 3 / 7 = 43% 第四项:3/7 = 43%
  • fifth item: 3 / 6 = 50% 第五项:3/6 = 50%
  • sixth item: 3 / 5 = 60% 第六项:3/5 = 60%
  • seventh item: 3 / 4 = 75% 第七项:3/4 = 75%
  • eighth item: 3 / 3 = 100% 第八项:3/3 = 100%
  • ninth item: 2 / 2 = 100% 第九项:2/2 = 100%
  • tenth item: 1 / 1 = 100% 第十项:1/1 = 100%

So even though the probability varies per line, overall you get the results you are looking for. 因此,即使概率因线而异,但总体而言,您可以获得所需的结果。 I went a step further and coded a test in Python for one million iterations as a final proof to myself -- remove seven items from a list of 100: 我更进一步,用Python编写了一个测试,进行了一百万次迭代,作为我自己的最终证明 - 从100个列表中删除七个项目:

# python 3.2
from __future__ import division
from stats import mean  # http://pypi.python.org/pypi/stats
import random

counts = dict()
for i in range(100):
    counts[i] = 0

removed_failed = 0

for _ in range(1000000):
    to_remove = 7
    from_list = list(range(100))
    removed = 0
    while from_list:
        current = from_list.pop()
        probability = to_remove / (len(from_list) + 1)
        if random.random() < probability:
            removed += 1
            to_remove -= 1
            counts[current] += 1
    if removed != 7:
        removed_failed += 1

print(counts[0], counts[1], counts[2], '...',
      counts[49], counts[50], counts[51], '...',
      counts[97], counts[98], counts[99])
print("remove failed: ", removed_failed)
print("min: ", min(counts.values()))
print("max: ", max(counts.values()))
print("mean: ", mean(counts.values()))

and here's the results from one of the several times I ran it (they were all similar): 这是我运行它的几次之一的结果(它们都很相似):

70125 69667 70081 ... 70038 70085 70121 ... 70047 70040 70170
remove failed:  0
min:  69332
max:  70599
mean:  70000.0

A final note: Python's random.random() is [0.0, 1.0) (doesn't include 1.0 as a possibility). 最后一点:Python的random.random()是[ random.random() ](不包括1.0作为可能性)。

I believe you're looking for "Algorithm S" from section 3.4.2 of Knuth (DE Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981) . 我相信你正在寻找Knuth 3.4.2节中的“算法S”(DE Knuth,计算机程序设计的艺术。第2卷:研究数学算法,第2版.Addison-Wesley,1981)

You can see several implementations at http://rosettacode.org/wiki/Knuth%27s_algorithm_S 您可以在http://rosettacode.org/wiki/Knuth%27s_algorithm_S上看到几个实现

The Perlmonks list has some Perl implementations of Algorithm S and Algorithm R that might also prove useful. Perlmonks列表有一些算法S和算法R的Perl实现,它们也可能有用。

These algorithms rely on there being a meaningful interpretation of floating point numbers like 3609739/158609739, 3609738/158609738, etc. which might not have sufficient resolution with a standard Float datatype, unless the Float datatype is implemented using numbers of double precision or larger. 这些算法依赖于有是浮点数像158609739分之3609739有意义的解释,158609738分之3609738等,这些可能没有足够的分辨率与标准的Float数据类型,除非Float数据类型用的号码实现双精度或更大。

Here's a possible solution using Python: 这是使用Python的可能解决方案:

import random

skipping = random.sample(range(158609739), 3609739)

input = open(input)
output = open(output, 'w')

for i, line in enumerate(input):
    if i in skipping:
        continue
    output.write(line)

input.close()
output.close()

Here's another using Mark's method: 这是另一个使用Mark的方法:

import random

lines_in_file = 158609739
lines_left_in_file = lines_in_file
lines_to_delete = lines_in_file - 155000000

input = open(input)
output = open(output, 'w')

try:
    for line in input:
        current_probability = lines_to_delete / lines_left_in_file
        lines_left_in_file -= 1
        if random.random < current_probability:
            lines_to_delete -= 1
            continue
        output.write(line)
except ZeroDivisionError:
    print("More than %d lines in the file" % lines_in_file)
finally:
    input.close()
    output.close()

I wrote this code before seeing that Darren Yin has expressed its principle. 在看到Darren Yin表达了其原则之前,我编写了这段代码。

I've modified my code to take the use of name skipping (I didn't dare to choose kangaroo ...) and of keyword continue from Ethan Furman whose code's principle is the same too. 我修改了我的代码以使用名称skipping (我不敢选择kangaroo ......)和关键字continue来自Ethan Furman,其代码的原理也是相同的。

I defined default arguments for the parameters of the function in order that the function can be used several times without having to make re-assignement at each call. 我为函数的参数定义了默认参数,以便可以多次使用该函数,而无需在每次调用时重新分配。

import random
import os.path

def spurt(ff,skipping):
    for i,line in enumerate(ff):
        if i in skipping:
            print 'line %d excluded : %r' % (i,line)
            continue
        yield line

def randomly_reduce_file(filepath,nk = None,
                         d = {0:'st',1:'nd',2:'rd',3:'th'},spurt = spurt,
                         sample = random.sample,splitext = os.path.splitext):

    # count of the lines of the original file
    with open(filepath) as f:  nl = sum(1 for _ in f)

    # asking for the number of lines to keep, if not given as argument
    if nk is None:
        nk = int(raw_input('  The file has %d lines.'
                           '  How many of them do you '
                           'want to randomly keep ? : ' % nl))

    # transfer of the lines to keep,
    # from one file to another file with different name
    if nk<nl:
        with open(filepath,'rb') as f,\
             open('COPY'.join(splitext(filepath)),'wb') as g:
            g.writelines(  spurt(f,sample(xrange(0,nl),nl-nk) )  )
            # sample(xrange(0,nl),nl-nk) is the list
            # of the counting numbers of the lines to be excluded 
    else:
        print '   %d is %s than the number of lines (%d) in the file\n'\
              '   no operation has been performed'\
              % (nk,'the same' if nk==nl else 'greater',nl)

With the $RANDOM variable you can get a random number between 0 and 32,767. 使用$ RANDOM变量,您可以获得0到32,767之间的随机数。

With this, you could read in each line, and see if $RANDOM is less than 155,000,000 / 158,609,739 * 32,767 (which is 32,021), and if so, let the line through. 有了这个,您可以读取每一行,看看$ RANDOM是否小于155,000,000 / 158,609,739 * 32,767 (即32,021),如果是这样,请让该行通过。

Of course, this wouldn't give you exactly 150,000,000 lines, but pretty close to it depending on the normality of the random number generator. 当然,这不会给你准确的 150,000,000行,但非常接近它,这取决于随机数生成器的正常性。

EDIT: Here is some code to get you started: 编辑:这里有一些代码可以帮助您入门:

#!/bin/bash
while read line; do
  if (( $RANDOM < 32021 ))
  then
    echo $line
  fi
done

Call it like so: 这样称呼它:

thatScript.sh <inFile.txt >outFile.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM