简体   繁体   中英

python or bash script to pass all files in a folder to java command line

I have the following Java command line working fine Mac os.

java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt

Multiple files can be passed as input with spaces as follows.

java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt

Now I have 100 files in a folder. All these files I have to pass as input to this command. I used

python os.system in a for loop of directories as follows .

for i,f in enumerate(os.listdir(filedir)):

     os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" >        "annotate_%s.txt"' %(f,i))

This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.

Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.

If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.

You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.

But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:

ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
    edu.stanford.nlp.process.PTBTokenizer >> output.txt

Inside your input file directory you can do the following in bash:

#!/bin/bash
for file in *.txt
do
    input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt

If you want to run it as a script. Save the file with some name, my_exec.bash:

#!/bin/bash
if [ $# -ne 2 ]; then
    echo "Invalid Input. Enter a directory and a output file"
    exit 1
fi
if [ ! -d $1 ]; then
    echo "Please pass a valid directory"
    exit 1
fi
for file in $1*.txt
do
    input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2

Make it an executable file

chmod +x my_exec.bash

USAGE:

 ./my_exec.bash <folder> <output_file>

If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.

First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:

files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]

Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system , which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"

procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
    proc.wait()

So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:

procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
                           'edu.stanford.nlp.process.PTBTokenizer'] + group)
         for group in groups]

But now how to do you get all of the results?

One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:

procs = []
files = []
for i, group in enumerate(groups):
    file = open('output_{}'.format(i), 'w')
    files.append(file)
    procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
    proc.wait()
for file in files:
    file.close()

(You might want to use a with statement with ExitStack , but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close .)

To pass all .txt files in the current directory at once to the java subprocess:

#!/usr/bin/env python
from glob import glob
from subprocess import check_call

cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
    check_call(cmd + glob('*.txt'), stdout=file)

It is similar to running the shell command but without running the shell:

$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt

To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool :

#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
    from threading import get_ident # Python 3.3+
except ImportError: # Python 2
    from thread import get_ident

cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
    with open('output%d.txt' % get_ident(), 'ab', 0) as file:
        return files, call(cmd + files, stdout=file)

all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
   pass

It is similar to this xargs command ( suggested by @abarnert ):

$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt

except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM