Awk: loop & save different lines to different files?

Question

I'm looping over a series of large files with a shell script:

i=0
while read line
do

    # get first char of line
    first=`echo "$line" | head -c 1`

    # make output filename
    name="$first"
    if [ "$first" = "," ]; then
        name='comma'
    fi
    if [ "$first" = "." ]; then
        name='period'
    fi

    # save line to new file
    echo "$line" >> "$2/$name.txt"

    # show live counter and inc
    echo -en "\rLines:\t$i"
    ((i++))

done <$file

The first character in each line will either be alphanumeric, or one of the above defined characters (which is why I'm renaming them for use in the output file name).

It's way too slow.

5,000 lines takes 128seconds.

At this rate I've got a solid month of processing.

Will awk be faster here?

If so, how do I fit the logic into awk?

Answer 1

This can certainly be done more efficiently in bash.

To give you an example: echo foo | head echo foo | head does a fork() call, creates a subshell, sets up a pipeline, starts the external head program... and there's no reason for it at all.

If you want the first character of a line, without any inefficient mucking with subprocesses, it's as simple as this:

c=${line:0:1}

I would also seriously consider sorting your input, so you can only re-open the output file when a new first character is seen, rather than every time through the loop.

That is -- preprocess with sort (as by replacing <$file with < <(sort "$file") ) and do the following each time through the loop, reopening the output file only conditionally:

if [[ $name != "$current_name" ]] ; then
  current_name="$name"
  exec 4>>"$2/$name" # open the output file on FD 4
fi

...and then append to the open file descriptor:

printf '%s\n' "$line" >&4

(not using echo because it can behave undesirably if your line is, say, -e or -n ).

Alternately, if the number of possible output files is small, you can just open them all on different FDs up-front (substituting other, higher numbers where I chose 4 ), and conditionally output to one of those pre-opened files. Opening and closing files is expensive -- each close() forces a flush to disk -- so this should be a substantial help.

Answer 2

#!/usr/bin/awk -f
BEGIN {
    punctlist = ", . ? ! - '"
    pnamelist = "comma period question_mark exclamation_mark hyphen apostrophe"
    pcount = split(punctlist, puncts)
    ncount = split(pnamelist, pnames)
    if (pcount != ncount) {print "error: counts don't match, pcount:", pcount, "ncount:", ncount; exit}
    for (i = 1; i <= pcount; i++) {
        punct_lookup[puncts[i]] = pnames[i]
    }
}
{
    print > punct_lookup[substr($0, 1, 1)] ".txt"
    printf "\r%6d", i++
}
END {
    printf "\n"
}

The BEGIN block builds an associative array so you can do punct_lookup[","] and get "comma".

The main block simply does the lookups for the filenames and outputs the line to the file. In AWK, > truncates the file the first time and appends subsequently. If you have existing files that you don't want truncated, then change it to >> (but don't use >> otherwise).

Answer 3

A few things to speed it up:

Don't use echo/head to get the first character. You're spawning at least two additional processes per line. Instead, use bash's parameter expansion facilities to get the first character.
Use if-elif to avoid checking $first against all the possibilities each time. Even better, if you are using bash 4.0 or later, use an associative array to store the output file names, rather than checking against $first in a big if-statement for each line.
If you don't have a version of bash that supports associative arrays, replace your if statements with the following.
```
 if [[ "$first" = "," ]]; then name='comma' elif [[ "$first" = "." ]]; then name='period' else name="$first" fi 
```

But the following is suggested. Note the use of $REPLY as the default variable used by read if no name is given (just FYI).

declare -A OUTPUT_FNAMES
output[","]=comma
output["."]=period
output["?"]=question_mark
output["!"]=exclamation_mark
output["-"]=hyphen
output["'"]=apostrophe
i=0
while read
do

    # get first char of line
    first=${REPLY:0:1}

    # make output filename
    name=${output[$first]:-$first}

    # save line to new file
    echo $REPLY >> "$name.txt"

    # show live counter and inc
    echo -en "\r$i"
    ((i++))

done <$file

Answer 4

Yet another take:

declare -i i=0
declare -A names
while read line; do
    first=${line:0:1}
    if [[ -z ${names[$first]} ]]; then
        case $first in
            ,) names[$first]="$2/comma.txt" ;;
            .) names[$first]="$2/period.txt" ;;
            *) names[$first]="$2/$first.txt" ;;
        esac
    fi
    printf "%s\n" "$line" >> "${names[$first]}"
    printf "\rLine $((++i))"
done < "$file"

and

awk -v dir="$2" '
    {
        first = substr($0,1,1)
        if (! (first in names)) {
            if (first == ",")      names[first] = dir "/comma.txt"
            else if (first == ".") names[first] = dir "/period.txt"
            else                   names[first] = dir "/" first ".txt"
        }
        print > names[first]
        printf("\rLine %d", NR)
    }
'

Awk: loop & save different lines to different files?

Question

4 answers

solution1
3 ACCPTED 2012-05-15 16:31:49

solution2
2 2012-05-15 16:31:55

solution3
2 2012-05-15 16:32:14

solution4
1 2012-05-15 17:13:28

Awk: loop & save different lines to different files?

Question

4 answers

solution1 3 ACCPTED 2012-05-15 16:31:49

solution2 2 2012-05-15 16:31:55

solution3 2 2012-05-15 16:32:14

solution4 1 2012-05-15 17:13:28

solution1
3 ACCPTED 2012-05-15 16:31:49

solution2
2 2012-05-15 16:31:55

solution3
2 2012-05-15 16:32:14

solution4
1 2012-05-15 17:13:28