简体   繁体   中英

How to extract line numbers from multiple files to a single file

I'm working on a project in statistical machine translation in which I have 15 files in a folder (linenumberfiles/). Each file contains multiple line numbers in the following format (one line number per line):

12

15

19

I would like to extract 10 random line numbers from each of the 15 files to a single output file (OutputLinesFile) The tricky part is that a few of the files might contain fewer than 10 line numbers, in which case I'd like to extract as many line numbers as possible to the output file. The format of the output file should be the same as the input files (one line number per line). This is the code I have so far:

import glob
OutputLinesFile = open('OutputLineNumbers', 'w')
inputfiles=glob.glob('linenumberfiles/*')

for file in inputfiles:
    readfile=open(file).readlines()
    OutputLinesFile.write( str(readfile) )
OutputLinesFile.close() 

Has anyone got any ideas how to solve this problem? In advance, thanks for your help!

You can use random.shuffle and list slicing here:

import glob
import random
count = 10      #fetch at least this number of lines

with open('OutputLineNumbers', 'w') as fout:
   inputfiles=glob.glob('linenumberfiles/*')
   for file in inputfiles:
       with open(file) as f:
           lines = f.readlines()
           random.shuffle(lines)             #shuffle the lines
       fout.writelines(lines[:count]) #pick at most first 10 lines

or using random.randrange :

lines = f.readlines()
lines = [ lines[random.randrange(0, len(lines)] for _ in xrange(count) ]

and then : fout.writelines(lines)

First of all, you should use the with statement. Read here why . Example:

try:
    with open(file, 'r') as f:
        cont = f.readlines()
except IOError, err:
    print err  

Then you should have a look at the random module. To select random items from f use the sample -method. To check how many lines are n the input file just use the BIF len() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM