简体   繁体   English

随机选择文件中的行

[英]Randomly selecting lines from files

I have bunch of files and very file has a header of 5 lines. 我有一堆文件,文件有5行的标题。 In the rest of the file, pair of line form an entry. 在文件的其余部分,一对行形成一个条目。 I need to randomly select entry from these files. 我需要从这些文件中随机选择条目。 How can i select random files and random entry(pair of line, excluding header) ? 如何选择随机文件和随机条目(一对行,不包括标题)?

你可能会发现perlfaq5很有用。

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. 如果文件足够小,请将行对读入内存并从该数据结构中随机选择。 If the file is too large, Eugene Y provides the right answer: use reservoir sampling . 如果文件太大,Eugene Y提供正确的答案:使用水库采样

Here's an intuitive explanation for the algorithm. 这是算法的直观解释。

Process the file line by line.
pick = line, with probability 1/N, where N = line number

In other words, on line 1, we will pick line 1 with 1/1 probability. 换句话说,在第1行,我们将以1/1概率选择第1行。 On line 2, we will change the pick to line 2, with 1/2 probability. 在第2行,我们将选择更改为第2行,概率为1/2 On line 3, we will change the pick to line 3, with 1/3 probability. 在第3行,我们将选择更改为第3行,概率为1/3 Etc. 等等。

For an intuitive proof, imagine a file with 3 lines: 为了直观证明,想象一下有3行的文件:

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

The probability for each outcome: 每个结果的概率:

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

From there, the rest is induction. 从那里,其余的是归纳。 For example, suppose the file has 4 lines. 例如,假设文件有4行。 We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. 我们已经说服自己,从第3行开始,到目前为止,每一行(1,2,3)都有平等的机会成为我们当前的选择。 When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount ( 1/3 * 3/4 = 1/4 ). 当我们前进到第4行时,它将有1/4机会被选中 - 正是它应该是什么,从而将之前3行的概率减少恰当的数量( 1/3 * 3/4 = 1/4 )。

Here's the Perl FAQ answer , adapted to your problem. 这是Perl FAQ的答案 ,适合您的问题。

use strict;
use warnings;

# Ignore 5 lines.
<> for 1 .. 5;

# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
    my @lines;
    $lines[$_] = <> for 0 .. 1;

    $n ++;
    @picks = @lines if rand($n) < 1;
}

print @picks;
sed "1,5d" < FILENAME | sort -R | head -2

Python solution - reads file only once and requires little memory Python解决方案 - 只读取一次文件并且需要很少的内存

Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines 像这样调用getRandomItems(file('myHuge.log'), 5, 2) - 将返回2行的列表

from random import randrange

def getRandomItems(f, skipFirst=0, numItems=1):
    for _ in xrange(skipFirst):
        f.next()
    n = 0; r = []
    while True:
        try:
            nxt = [f.next() for _ in range(numItems)]
        except StopIteration: break
        n += 1
        if not randrange(n):
            r = nxt
    return r

Returns empty list if it could not get the first passable items from f. 如果无法从f获取第一个可通过的项,则返回空列表。 The code's only requirement is that argument f is an iterator (supports next() method). 代码的唯一要求是参数f是迭代器(支持next()方法)。 Hence we can pass something different than file, say we want to see the distribution: 因此我们可以传递与文件不同的东西,比如说我们想看看分布:

>>> s={}
>>> for i in xrange(5000):
...     r = getRandomItems(iter(xrange(50)))[0]
...     s[r] = 1 + s.get(r,0)
... 
>>> for i in s: 
...     print i, '*' * s[i]
... 
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************

Answer is in Python. 答案是Python。 Assuming you can read a whole file into memory. 假设您可以将整个文件读入内存。

#using python 2.6
import sys
import os
import itertools
import random

def main(directory, num_files=5, num_entries=5):
    file_paths = os.listdir(directory)

    # get a random sampling of the available paths
    chosen_paths = random.sample(file_paths, num_files)

    for path in chosen_paths:
        chosen_entries = get_random_entries(path, num_entries)
        for entry in chosen_entries:
            # do something with your chosen entries
            print entry

def get_random_entries(file_path, num_entries):
    with open(file_path, 'r') as file:
        # read the lines and slice off the headers
        lines = file.readlines()[5:]

        # group the lines into pairs (i.e. entries)
        entries = list(itertools.izip_longest(*[iter(lines)]*2))

        # return a random sampling of entries
        return random.sample(entries, num_entries)

if __name__ == '__main__':
    #use optparse here to do fancy things with the command line args
    main(sys.argv[1:])

Links: itertools , random , optparse 链接: itertoolsrandomoptparse

Two other means to do so: 1- by generators (may still require a lot of memory): http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html 另外两种方法:1-生成器(可能仍需要大量内存): http//www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators -.html

2- by a clever seeking (best method actually): http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html 2-聪明地寻求(实际上最好的方法): http//www.regexprn.com/2008/11/read-random-line-in-large-file-in.html

I here copy the code of the clever Jonathan Kupferman: 我在这里复制了聪明的Jonathan Kupferman的代码:

#!/usr/bin/python

import os,random

filename="averylargefile"
file = open(filename,'r')

#Get the total file size
file_size = os.stat(filename)[6]

while 1:
      #Seek to a place in the file which is a random distance away
      #Mod by file size so that it wraps around to the beginning
      file.seek((file.tell()+random.randint(0,file_size-1))%file_size)

      #dont use the first readline since it may fall in the middle of a line
      file.readline()
      #this will return the next (complete) line from the file
      line = file.readline()

      #here is your random line in the file
      print line

Another Python option; 另一个Python选项; reading the contents of all files into memory: 将所有文件的内容读入内存:

import random
import fileinput

def openhook(filename, mode):
    f = open(filename, mode)
    headers = [f.readline() for _ in range(5)]
    return f

num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM