简体   繁体   中英

Slow bash script using grep and sed

I'm trying to speed up my script, which currently takes approx 30 seconds. I am a novice to bash, and I am sure I am using some bad scripting practice (found some hint in https://unix.stackexchange.com/a/169765 but still cannot fix my issue).

What I need to do is get data from an external file, and extract numbers into two arrays. My script works fine, except it is too slow.

readData=`cat $myfile` 
# readData = [[1491476100000,60204],[1491476130000,59734],...,[1491476160000,60150]]
# I have approximately 5000 points (two numbers in each point)
pointTime=()
pointVal=()

for line in `echo $readData | grep -Po "[0-9]+,[0-9]+"`; do  
  # Get first number but drop last three zeroes (e.g. 1491476100)
  pointTime+=(`echo $line | grep -Po "^[0-9]+" | sed "s/\(.*\)000$/\1/"`)   
  # Get second number, e.g. 60204
  pointVal+=(`echo $line | grep -Po "[0-9]+$"`)
done 

Maybe I could use some regex inside a parameter expansion, but I don't know how.

Fast Alternative

Here's how I would write the script:

mapfile -t points < <(grep -Po '\d+,\d+' "$myfile")
pointTime=("${points[@]%000,*}")
pointVal=("${points[@]#*,}")

or even

mapfile -t pointTime < <(grep -Po '\d+(?=000,)' "$myfile")
mapfile -t pointVal < <(grep -Po ',\K\d+' "$myfile")

when you are sure that the file is well-formed.

Problems of the old Script

You already identified the main problem: The loop is slow, especially, since a lot of programs are called inside the loop. Nevertheless here are some hints how you could have improved your script without throwing away the loop. Some parts were needlessly complicated, for instance

readData=`cat $myfile` 
`echo $readData | grep -Po "[0-9]+,[0-9]+"`

can be written as

grep -Po "[0-9]+,[0-9]+" "$myfile"

and

echo $line | grep -Po "^[0-9]+" | sed "s/\(.*\)000$/\1/"

can be written as

grep -Po "^[0-9]+(?=000)" <<< "$line"

A big speed boost would be to use bash's matching operator =~ instead of grep , because just starting up grep is slow.

[[ "$line" =~ (.*)000,(.*) ]]
pointTime+=("${BASH_REMATCH[1]}")
pointTime+=("${BASH_REMATCH[2]}")

I'm suspicious of the requirement to store the results in an array. You probably actually want to loop over the values in pairs. In any event, storing intermediate values in memory is inelegant and wasteful.

grep -Eo '[0-9]+,[0-9]+' "$myfile" |
while IFS=, read -r first second, do
    process value pair "${first%000}" "$second"
done

If you insist on storing the values in an array, how to change the body of the loop should be obvious.

    pointTime+=("${first%000}")
    pointVal+=("$second")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM