[英]Bash multiple files processing
我有一個名為 data_file 的文件,其中包含數據:london paris newyork italy...50 更多項目
有一個包含超過 75 個文件的目錄,例如 dfile1、dfie2...agle75,我正在其中搜索 data_file 中的條目。
files=$(find . -type f)
for f in $files; do
while read -r line; do
found=$(grep $line $f)
if [ ! -z "$found" ]; then
perform task here
fi
done < data_file
done
由於循環逐個運行每個文件,因此需要花費大量時間才能完成。 如何加快速度,我可以同時為多個文件運行 for 循環嗎?
使用 GNU Parallel,您可以執行以下操作:
doit() {
f="$1"
line="$2"
found=$(grep $line $f)
if [ ! -z "$found" ]; then
perform task here
fi
}
export -f doit
find . -type f | parallel doit :::: - data_file
下面的例子是一個成熟的並行執行方法,它處理:
在您的示例中,您的(強化)代碼如下所示:
# Load the ExecTasks function described below (must be in the same directory as this one)
source ./exectasks.sh
directoryToProcess="/my/dir/to/find/stuff/into"
tasklist=""
# Prepare task list separated by semicolumn
while IFS= read -r -d $'\0' file; do
if grep "$line" "$file" > /dev/null 2>&1; then
tasklist="$tasklist""my_task;"
done < <(find "$directoryToProcess" -type f -print0)
# Run tasks
ExecTasks "$tasklist" "trivial-task-id" false 1800 3600 18000 36000 true 1 1800 true false false 8
在這里,我們使用了一個復雜的 function ExecTasks,它將處理並行排隊的任務,讓您保持對正在發生的事情的控制,而不必擔心因為一些掛起的任務而阻塞腳本。
ExecTasks arguments的快速解釋:
"$tasklist" = variable containing task list
"some name" trivial task id (in order to identify in logs)
boolean: read tasks from file (you may have passed a task list from a file if there are too many to fit in a variable
1800 = maximum number of seconds a task may be executed before a warning is raised
3600 = maximum number of seconds a task may be executed before an error is raised and the tasks is stopped
18000 = maximum number of seconds the whole tasks may be executed before a warning is raised
36000 = maximum number of seconds the whole tasks may be executed before an error is raised and all the tasks are stopped
boolean: account execution time since beginning of tasks execution (true) or since script begin
1 = number of seconds between each state check (accepts float like .1)
1800 = Number of seconds between each "i am alive" log just to know everything works as expected
boolean: show spinner (true) or not (false)
boolean: log errors when reaching max times (false) or do not log them (true)
boolean: do not log any errors at all (false) or do log them (true)
And finally
8 = number of simultaneous tasks to launch (8 in our case)
這是 exectasks.sh 的源代碼(您也可以將其直接復制粘貼到腳本 header 而不是source./exectasks.sh
):
function Logger {
# Dummy log function, replace with whatever you need
echo "$2: $1"
}
# Nice cli spinner so we now execution is ongoing
_OFUNCTIONS_SPINNER="|/-\\"
function Spinner {
printf " [%c] \b\b\b\b\b\b" "$_OFUNCTIONS_SPINNER"
_OFUNCTIONS_SPINNER=${_OFUNCTIONS_SPINNER#?}${_OFUNCTIONS_SPINNER%%???}
return 0
}
# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
local pid="${1}" # Parent pid to kill childs
local self="${2:-false}" # Should parent be killed too ?
# Paranoid checks, we can safely assume that $pid should not be 0 nor 1
if [ $(IsInteger "$pid") -eq 0 ] || [ "$pid" == "" ] || [ "$pid" == "0" ] || [ "$pid" == "1" ]; then
Logger "Bogus pid given [$pid]." "CRITICAL"
return 1
fi
if kill -0 "$pid" > /dev/null 2>&1; then
if children="$(pgrep -P "$pid")"; then
if [[ "$pid" == *"$children"* ]]; then
Logger "Bogus pgrep implementation." "CRITICAL"
children="${children/$pid/}"
fi
for child in $children; do
Logger "Launching KillChilds \"$child\" true" "DEBUG" #__WITH_PARANOIA_DEBUG
KillChilds "$child" true
done
fi
fi
# Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
if [ "$self" == true ]; then
# We need to check for pid again because it may have disappeared after recursive function call
if kill -0 "$pid" > /dev/null 2>&1; then
kill -s TERM "$pid"
Logger "Sent SIGTERM to process [$pid]." "DEBUG"
if [ $? -ne 0 ]; then
sleep 15
Logger "Sending SIGTERM to process [$pid] failed." "DEBUG"
kill -9 "$pid"
if [ $? -ne 0 ]; then
Logger "Sending SIGKILL to process [$pid] failed." "DEBUG"
return 1
fi # Simplify the return 0 logic here
else
return 0
fi
else
return 0
fi
else
return 0
fi
}
function ExecTasks {
# Mandatory arguments
local mainInput="${1}" # Contains list of pids / commands separated by semicolons or filepath to list of pids / commands
# Optional arguments
local id="${2:-base}" # Optional ID in order to identify global variables from this run (only bash variable names, no '-'). Global variables are WAIT_FOR_TASK_COMPLETION_$id and HARD_MAX_EXEC_TIME_REACHED_$id
local readFromFile="${3:-false}" # Is mainInput / auxInput a semicolon separated list (true) or a filepath (false)
local softPerProcessTime="${4:-0}" # Max time (in seconds) a pid or command can run before a warning is logged, unless set to 0
local hardPerProcessTime="${5:-0}" # Max time (in seconds) a pid or command can run before the given command / pid is stopped, unless set to 0
local softMaxTime="${6:-0}" # Max time (in seconds) for the whole function to run before a warning is logged, unless set to 0
local hardMaxTime="${7:-0}" # Max time (in seconds) for the whole function to run before all pids / commands given are stopped, unless set to 0
local counting="${8:-true}" # Should softMaxTime and hardMaxTime be accounted since function begin (true) or since script begin (false)
local sleepTime="${9:-.5}" # Seconds between each state check. The shorter the value, the snappier ExecTasks will be, but as a tradeoff, more cpu power will be used (good values are between .05 and 1)
local keepLogging="${10:-1800}" # Every keepLogging seconds, an alive message is logged. Setting this value to zero disables any alive logging
local spinner="${11:-true}" # Show spinner (true) or do not show anything (false) while running
local noTimeErrorLog="${12:-false}" # Log errors when reaching soft / hard execution times (false) or do not log errors on those triggers (true)
local noErrorLogsAtAll="${13:-false}" # Do not log any errros at all (useful for recursive ExecTasks checks)
# Parallelism specific arguments
local numberOfProcesses="${14:-0}" # Number of simulanteous commands to run, given as mainInput. Set to 0 by default (WaitForTaskCompletion mode). Setting this value enables ParallelExec mode.
local auxInput="${15}" # Contains list of commands separated by semicolons or filepath fo list of commands. Exit code of those commands decide whether main commands will be executed or not
local maxPostponeRetries="${16:-3}" # If a conditional command fails, how many times shall we try to postpone the associated main command. Set this to 0 to disable postponing
local minTimeBetweenRetries="${17:-300}" # Time (in seconds) between postponed command retries
local validExitCodes="${18:-0}" # Semi colon separated list of valid main command exit codes which will not trigger errors
local i
# Expand validExitCodes into array
IFS=';' read -r -a validExitCodes <<< "$validExitCodes"
# ParallelExec specific variables
local auxItemCount=0 # Number of conditional commands
local commandsArray=() # Array containing commands
local commandsConditionArray=() # Array containing conditional commands
local currentCommand # Variable containing currently processed command
local currentCommandCondition # Variable containing currently processed conditional command
local commandsArrayPid=() # Array containing commands indexed by pids
local commandsArrayOutput=() # Array containing command results indexed by pids
local postponedRetryCount=0 # Number of current postponed commands retries
local postponedItemCount=0 # Number of commands that have been postponed (keep at least one in order to check once)
local postponedCounter=0
local isPostponedCommand=false # Is the current command from a postponed file ?
local postponedExecTime=0 # How much time has passed since last postponed condition was checked
local needsPostponing # Does currentCommand need to be postponed
local temp
# Common variables
local pid # Current pid working on
local pidState # State of the process
local mainItemCount=0 # number of given items (pids or commands)
local readFromFile # Should we read pids / commands from a file (true)
local counter=0
local log_ttime=0 # local time instance for comparaison
local seconds_begin=$SECONDS # Seconds since the beginning of the script
local exec_time=0 # Seconds since the beginning of this function
local retval=0 # return value of monitored pid process
local subRetval=0 # return value of condition commands
local errorcount=0 # Number of pids that finished with errors
local pidsArray # Array of currently running pids
local newPidsArray # New array of currently running pids for next iteration
local pidsTimeArray # Array containing execution begin time of pids
local executeCommand # Boolean to check if currentCommand can be executed given a condition
local functionMode
local softAlert=false # Does a soft alert need to be triggered, if yes, send an alert once
local failedPidsList # List containing failed pids with exit code separated by semicolons (eg : 2355:1;4534:2;2354:3)
local randomOutputName # Random filename for command outputs
local currentRunningPids # String of pids running, used for debugging purposes only
# fnver 2019081401
# Initialise global variable
eval "WAIT_FOR_TASK_COMPLETION_$id=\"\""
eval "HARD_MAX_EXEC_TIME_REACHED_$id=false"
# Init function variables depending on mode
if [ $numberOfProcesses -gt 0 ]; then
functionMode=ParallelExec
else
functionMode=WaitForTaskCompletion
fi
if [ $readFromFile == false ]; then
if [ $functionMode == "WaitForTaskCompletion" ]; then
IFS=';' read -r -a pidsArray <<< "$mainInput"
mainItemCount="${#pidsArray[@]}"
else
IFS=';' read -r -a commandsArray <<< "$mainInput"
mainItemCount="${#commandsArray[@]}"
IFS=';' read -r -a commandsConditionArray <<< "$auxInput"
auxItemCount="${#commandsConditionArray[@]}"
fi
else
if [ -f "$mainInput" ]; then
mainItemCount=$(wc -l < "$mainInput")
readFromFile=true
else
Logger "Cannot read main file [$mainInput]." "WARN"
fi
if [ "$auxInput" != "" ]; then
if [ -f "$auxInput" ]; then
auxItemCount=$(wc -l < "$auxInput")
else
Logger "Cannot read aux file [$auxInput]." "WARN"
fi
fi
fi
if [ $functionMode == "WaitForTaskCompletion" ]; then
# Force first while loop condition to be true because we don't deal with counters but pids in WaitForTaskCompletion mode
counter=$mainItemCount
fi
# soft / hard execution time checks that needs to be a subfunction since it is called both from main loop and from parallelExec sub loop
function _ExecTasksTimeCheck {
if [ $spinner == true ]; then
Spinner
fi
if [ $counting == true ]; then
exec_time=$((SECONDS - seconds_begin))
else
exec_time=$SECONDS
fi
if [ $keepLogging -ne 0 ]; then
# This log solely exists for readability purposes before having next set of logs
if [ ${#pidsArray[@]} -eq $numberOfProcesses ] && [ $log_ttime -eq 0 ]; then
log_ttime=$exec_time
Logger "There are $((mainItemCount-counter+postponedItemCount)) / $mainItemCount tasks in the queue of which $postponedItemCount are postponed. Currently, ${#pidsArray[@]} tasks running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
fi
if [ $(((exec_time + 1) % keepLogging)) -eq 0 ]; then
if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1 second
log_ttime=$exec_time
if [ $functionMode == "WaitForTaskCompletion" ]; then
Logger "Current tasks still running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
elif [ $functionMode == "ParallelExec" ]; then
Logger "There are $((mainItemCount-counter+postponedItemCount)) / $mainItemCount tasks in the queue of which $postponedItemCount are postponed. Currently, ${#pidsArray[@]} tasks running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
fi
fi
fi
fi
if [ $exec_time -gt $softMaxTime ]; then
if [ "$softAlert" != true ] && [ $softMaxTime -ne 0 ] && [ $noTimeErrorLog != true ]; then
Logger "Max soft execution time [$softMaxTime] exceeded for task [$id] with pids [$(joinString , ${pidsArray[@]})]." "WARN"
softAlert=true
SendAlert true
fi
fi
if [ $exec_time -gt $hardMaxTime ] && [ $hardMaxTime -ne 0 ]; then
if [ $noTimeErrorLog != true ]; then
Logger "Max hard execution time [$hardMaxTime] exceeded for task [$id] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution." "ERROR"
fi
for pid in "${pidsArray[@]}"; do
KillChilds $pid true
if [ $? -eq 0 ]; then
Logger "Task with pid [$pid] stopped successfully." "NOTICE"
else
if [ $noErrorLogsAtAll != true ]; then
Logger "Could not stop task with pid [$pid]." "ERROR"
fi
fi
errorcount=$((errorcount+1))
done
if [ $noTimeErrorLog != true ]; then
SendAlert true
fi
eval "HARD_MAX_EXEC_TIME_REACHED_$id=true"
if [ $functionMode == "WaitForTaskCompletion" ]; then
return $errorcount
else
return 129
fi
fi
}
function _ExecTasksPidsCheck {
newPidsArray=()
if [ "$currentRunningPids" != "$(joinString " " ${pidsArray[@]})" ]; then
Logger "ExecTask running for pids [$(joinString " " ${pidsArray[@]})]." "DEBUG"
currentRunningPids="$(joinString " " ${pidsArray[@]})"
fi
for pid in "${pidsArray[@]}"; do
if [ $(IsInteger $pid) -eq 1 ]; then
if kill -0 $pid > /dev/null 2>&1; then
# Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
pidState="$(eval $PROCESS_STATE_CMD)"
if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
# Check if pid hasn't run more than soft/hard perProcessTime
pidsTimeArray[$pid]=$((SECONDS - seconds_begin))
if [ ${pidsTimeArray[$pid]} -gt $softPerProcessTime ]; then
if [ "$softAlert" != true ] && [ $softPerProcessTime -ne 0 ] && [ $noTimeErrorLog != true ]; then
Logger "Max soft execution time [$softPerProcessTime] exceeded for pid [$pid]." "WARN"
if [ "${commandsArrayPid[$pid]}]" != "" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]]." "WARN"
fi
softAlert=true
SendAlert true
fi
fi
if [ ${pidsTimeArray[$pid]} -gt $hardPerProcessTime ] && [ $hardPerProcessTime -ne 0 ]; then
if [ $noTimeErrorLog != true ] && [ $noErrorLogsAtAll != true ]; then
Logger "Max hard execution time [$hardPerProcessTime] exceeded for pid [$pid]. Stopping command execution." "ERROR"
if [ "${commandsArrayPid[$pid]}]" != "" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]]." "WARN"
fi
fi
KillChilds $pid true
if [ $? -eq 0 ]; then
Logger "Command with pid [$pid] stopped successfully." "NOTICE"
else
if [ $noErrorLogsAtAll != true ]; then
Logger "Could not stop command with pid [$pid]." "ERROR"
fi
fi
errorcount=$((errorcount+1))
if [ $noTimeErrorLog != true ]; then
SendAlert true
fi
fi
newPidsArray+=($pid)
fi
else
# pid is dead, get its exit code from wait command
wait $pid
retval=$?
# Check for valid exit codes
if [ $(ArrayContains $retval "${validExitCodes[@]}") -eq 0 ]; then
if [ $noErrorLogsAtAll != true ]; then
Logger "${FUNCNAME[0]} called by [$id] finished monitoring pid [$pid] with exitcode [$retval]." "ERROR"
if [ "$functionMode" == "ParallelExec" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]." "ERROR"
fi
if [ -f "${commandsArrayOutput[$pid]}" ]; then
Logger "Truncated output:\n$(head -c16384 "${commandsArrayOutput[$pid]}")" "ERROR"
fi
fi
errorcount=$((errorcount+1))
# Welcome to variable variable bash hell
if [ "$failedPidsList" == "" ]; then
failedPidsList="$pid:$retval"
else
failedPidsList="$failedPidsList;$pid:$retval"
fi
else
Logger "${FUNCNAME[0]} called by [$id] finished monitoring pid [$pid] with exitcode [$retval]." "DEBUG"
fi
fi
fi
done
# hasPids can be false on last iteration in ParallelExec mode
pidsArray=("${newPidsArray[@]}")
# Trivial wait time for bash to not eat up all CPU
sleep $sleepTime
}
while [ ${#pidsArray[@]} -gt 0 ] || [ $counter -lt $mainItemCount ] || [ $postponedItemCount -ne 0 ]; do
_ExecTasksTimeCheck
retval=$?
if [ $retval -ne 0 ]; then
return $retval;
fi
# The following execution bloc is only needed in ParallelExec mode since WaitForTaskCompletion does not execute commands, but only monitors them
if [ $functionMode == "ParallelExec" ]; then
while [ ${#pidsArray[@]} -lt $numberOfProcesses ] && ([ $counter -lt $mainItemCount ] || [ $postponedItemCount -ne 0 ]); do
_ExecTasksTimeCheck
retval=$?
if [ $retval -ne 0 ]; then
return $retval;
fi
executeCommand=false
isPostponedCommand=false
currentCommand=""
currentCommandCondition=""
needsPostponing=false
if [ $readFromFile == true ]; then
# awk identifies first line as 1 instead of 0 so we need to increase counter
currentCommand=$(awk 'NR == num_line {print; exit}' num_line=$((counter+1)) "$mainInput")
if [ $auxItemCount -ne 0 ]; then
currentCommandCondition=$(awk 'NR == num_line {print; exit}' num_line=$((counter+1)) "$auxInput")
fi
# Check if we need to fetch postponed commands
if [ "$currentCommand" == "" ]; then
currentCommand=$(awk 'NR == num_line {print; exit}' num_line=$((postponedCounter+1)) "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedMain.$id.$SCRIPT_PID.$TSTAMP")
currentCommandCondition=$(awk 'NR == num_line {print; exit}' num_line=$((postponedCounter+1)) "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedAux.$id.$SCRIPT_PID.$TSTAMP")
isPostponedCommand=true
fi
else
currentCommand="${commandsArray[$counter]}"
if [ $auxItemCount -ne 0 ]; then
currentCommandCondition="${commandsConditionArray[$counter]}"
fi
if [ "$currentCommand" == "" ]; then
currentCommand="${postponedCommandsArray[$postponedCounter]}"
currentCommandCondition="${postponedCommandsConditionArray[$postponedCounter]}"
isPostponedCommand=true
fi
fi
# Check if we execute postponed commands, or if we delay them
if [ $isPostponedCommand == true ]; then
# Get first value before '@'
postponedExecTime="${currentCommand%%@*}"
postponedExecTime=$((SECONDS-postponedExecTime))
# Get everything after first '@'
temp="${currentCommand#*@}"
# Get first value before '@'
postponedRetryCount="${temp%%@*}"
# Replace currentCommand with actual filtered currentCommand
currentCommand="${temp#*@}"
# Since we read a postponed command, we may decrase postponedItemCounter
postponedItemCount=$((postponedItemCount-1))
#Since we read one line, we need to increase the counter
postponedCounter=$((postponedCounter+1))
else
postponedRetryCount=0
postponedExecTime=0
fi
if ([ $postponedRetryCount -lt $maxPostponeRetries ] && [ $postponedExecTime -ge $minTimeBetweenRetries ]) || [ $isPostponedCommand == false ]; then
if [ "$currentCommandCondition" != "" ]; then
Logger "Checking condition [$currentCommandCondition] for command [$currentCommand]." "DEBUG"
eval "$currentCommandCondition" &
ExecTasks $! "subConditionCheck" false 0 0 1800 3600 true $SLEEP_TIME $KEEP_LOGGING true true true
subRetval=$?
if [ $subRetval -ne 0 ]; then
# is postponing enabled ?
if [ $maxPostponeRetries -gt 0 ]; then
Logger "Condition [$currentCommandCondition] not met for command [$currentCommand]. Exit code [$subRetval]. Postponing command." "NOTICE"
postponedRetryCount=$((postponedRetryCount+1))
if [ $postponedRetryCount -ge $maxPostponeRetries ]; then
Logger "Max retries reached for postponed command [$currentCommand]. Skipping command." "NOTICE"
else
needsPostponing=true
fi
postponedExecTime=0
else
Logger "Condition [$currentCommandCondition] not met for command [$currentCommand]. Exit code [$subRetval]. Ignoring command." "NOTICE"
fi
else
executeCommand=true
fi
else
executeCommand=true
fi
else
needsPostponing=true
fi
if [ $needsPostponing == true ]; then
postponedItemCount=$((postponedItemCount+1))
if [ $readFromFile == true ]; then
echo "$((SECONDS-postponedExecTime))@$postponedRetryCount@$currentCommand" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedMain.$id.$SCRIPT_PID.$TSTAMP"
echo "$currentCommandCondition" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedAux.$id.$SCRIPT_PID.$TSTAMP"
else
postponedCommandsArray+=("$((SECONDS-postponedExecTime))@$postponedRetryCount@$currentCommand")
postponedCommandsConditionArray+=("$currentCommandCondition")
fi
fi
if [ $executeCommand == true ]; then
Logger "Running command [$currentCommand]." "DEBUG"
randomOutputName=$(date '+%Y%m%dT%H%M%S').$(PoorMansRandomGenerator 5)
eval "$currentCommand" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}.$id.$pid.$randomOutputName.$SCRIPT_PID.$TSTAMP" 2>&1 &
pid=$!
pidsArray+=($pid)
commandsArrayPid[$pid]="$currentCommand"
commandsArrayOutput[$pid]="$RUN_DIR/$PROGRAM.${FUNCNAME[0]}.$id.$pid.$randomOutputName.$SCRIPT_PID.$TSTAMP"
# Initialize pid execution time array
pidsTimeArray[$pid]=0
else
Logger "Skipping command [$currentCommand]." "DEBUG"
fi
if [ $isPostponedCommand == false ]; then
counter=$((counter+1))
fi
_ExecTasksPidsCheck
done
fi
_ExecTasksPidsCheck
done
# Return exit code if only one process was monitored, else return number of errors
# As we cannot return multiple values, a global variable WAIT_FOR_TASK_COMPLETION contains all pids with their return value
eval "WAIT_FOR_TASK_COMPLETION_$id=\"$failedPidsList\""
if [ $mainItemCount -eq 1 ]; then
return $retval
else
return $errorcount
fi
}
希望你玩得開心。
你可以這樣做:
files=$(find . -type f)
for f in $files; do
while read -r line; do
{
found=$(grep $line $f)
if [ ! -z "$found" ]; then
## perform task here
fi
} &
done < data_file
done
wait
它將在后台執行{}內的塊。 因此,基本上它會打開與您擁有的文件一樣多的后台進程。 如果您想更好地控制實際產生了多少進程,您可以改用parallel 。
您可以使用 grep 的q
選項在第一次匹配后停止搜索,並使用f
選項從文件中獲取模式:
for f in $(find . -type f); do
if $(grep -qf data_file "$f"); then
...
fi
done
如果data_file
包含:
xxx
yyy
zzz
然后grep -qf "$data_file" "$f"
如果在$f
中找到xxx
、 yyy
或zzz
,則計算結果為 true 。
find 命令會減慢速度,並且腳本比它需要的更復雜。
If you want to do this with grep, better to loop through data_file and within that grep $line * > /dev/null && do_something
(or grep -R $line * > /dev/null && do_something
if there are subdirectories to deal with )
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.