简体   繁体   中英

Comparing strings for alphabetical order in Bash, test vs. double bracket syntax

I am working on a Bash scripting project in which I need to delete one of two files if they have identical content. I should delete the one which comes last in an alphabetical sort and in the example output my professor has provided, apple.dat is deleted when the choices are apple.dat and Apple.dat.

if [[ "apple" > "Apple" ]]; then
    echo apple
else
    echo Apple
fi

prints Apple

echo $(echo -e "Apple\napple" | sort | tail -n1)

prints Apple

The ASCII value of a is 97 and A is 65, why is the test saying A is greater?

The weird thing is that I get opposite results with the older syntax:

if [ "apple" \> "Apple" ]; then
    echo apple
else
    echo Apple
fi

prints apple

and if we try to use the \> in the [[ ]] syntax, it is a syntax error.

How can we correct this for the double bracket syntax? I have tested this on the school Debian server, my local machine, and my Digital Ocean droplet server. On my local Ubuntu 20.04 and on the school server I get the output described above. Interestingly, on my Digital Ocean droplet which is an Ubuntu 20.04 server, I get "apple" with both double and single bracket syntax. We are allowed to use either syntax, double bracket or the single bracket actual test call, however I prefer using the newer double bracket syntax and would rather learn how to make this work than to convert my mostly finished script to the older more POSIX compliant syntax.

Hints:

$ (LC_COLLATE=C; if [ "apple" \> "Apple" ]; then echo apple; else echo Apple; fi)
apple
$ (LC_COLLATE=en_US; if [ "apple" \> "Apple" ]; then echo apple; else echo Apple; fi)
apple

but:

$ (LC_COLLATE=C; if [[ "apple" > "Apple" ]]; then echo apple; else echo Apple; fi)
apple
$ (LC_COLLATE=en_US; if [[ "apple" > "Apple" ]]; then echo apple; else echo Apple; fi)
Apple

The difference is that the Bash specific test [[ ]] uses the locale collation's rules to compare strings. Whereas the POSIX test [ ] uses the ASCII value.

From bash man page:

When used with [[ , the < and > operators sort lexicographically using the current locale .

When used with test or [ , the < and > operators sort lexicographically using ASCII ordering .

I have come up with my own solution to the problem, however I must first thank @GordonDavisson and @LéaGris for their help and for what I have learned from them as that is invaluable to me.

No matter if computer or human locale is used, if, in an alphabetical sort, apple comes after Apple, then it also comes after Banana and if Banana comes after apple, then Apple comes after apple. So I have come up with the following:

# A function which sorts two words alphabetically with lower case coming after upper case.
# The last word in the sort will be printed twice to demonstrate that this works for both
# the POSIX compliant single bracket test call and the newer double bracket condition
# syntax.
# arg 1: One of two words to sort
# arg 2: One of two words to sort
# Return: 0 upon completion, 1 if incorrect number of args is given
sort_alphabetically() {
    [ $# -ne 2 ] && return 1

    word_1_val=0
    word_2_val=0

    while read -n1 letter; do
        (( word_1_val += $(printf '%d' "'$letter") ))
    done < <(echo -n "$1")

    while read -n1 letter; do
        (( word_2_val += $(printf '%d' "'$letter") ))
    done < <(echo -n "$2")

    if [ $word_1_val -gt $word_2_val ]; then
        echo $1
    else
        echo $2
    fi

    if [[ $word_1_val -gt $word_2_val ]]; then
        echo $1
    else
        echo $2
    fi

    return 0
}

sort_alphabetically "apple" "Apple"
sort_alphabetically "Banana" "apple"
sort_alphabetically "aPPle" "applE"

prints:

apple
apple
Banana
Banana
applE
applE

This works using process substitution and redirecting the output into the while loop to read one character at a time and then using printf to get the decimal ASCII value of each character. It is like creating a temporary file from the string which will be automatically destroyed and then reading it one character at a time. The -n for echo means the \n character, if there is one from user input or something, will be ignored.

From bash man pages:

Process Substitution

Process substitution allows a process's input or output to be referred to using a filename. It takes the form of <(list) or >(list) . The process list is run asynchronously, and its input or output appears as a filename. This filename is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list. Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files.

When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

from stackoverflow post about printf :

If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.

Note: process substitution is not POSIX compliant, but it is supported by Bash in the way stated in the bash man page.


UPDATE: The above does not work in all cases!


The above solution works in many cases however we get some anomalies.

first word second word last alphabetically
apple Apple apple correct
Apple apple apple correct
apPLE Apple Apple incorrect
apple Banana Banana correct
apple BANANA apple incorrect

The following solution gets the results that are needed:

#!/bin/bash

sort_alphabetically() {
    [ $# -ne 2 ] && return 1

    local WORD_1="$1"
    local WORD_2="$2"
    local WORD_1_LOWERED="$(echo -n $1 | tr '[:upper:]' '[:lower:]')"
    local WORD_2_LOWERED="$(echo -n $2 | tr '[:upper:]' '[:lower:]')"

    if [ $(echo -e "$WORD_1\n$WORD_2" | sort | tail -n1) = "$WORD_1" ] ||\
       [ $(echo -e "$WORD_1_LOWERED\n$WORD_2_LOWERED" | sort | tail -n1) =\
         "$WORD_1_LOWERED" ]; then

        if [ "$WORD_1_LOWERED" = "$WORD_2_LOWERED" ]; then

            ASCII_VAL_WORD_1=0
            ASCII_VAL_WORD_2=0
            read -n1 FIRST_CHAR_1 < <(echo -n "$WORD_1")
            read -n1 FIRST_CHAR_2 < <(echo -n "$WORD_2")

            while read -n1 character; do
                (( ASCII_VAL_WORD_1 += $(printf '%d' "'$character") ))
            done < <(echo -n $WORD_1)
            
            while read -n1 character; do
                (( ASCII_VAL_WORD_2 += $(printf '%d' "'$character") ))
            done < <(echo -n $WORD_2)
            
            if [ $ASCII_VAL_WORD_1 -gt $ASCII_VAL_WORD_2 ] &&\
               [ "$FIRST_CHAR_1" \> "$FIRST_CHAR_2" ]; then

                echo "$WORD_1"
            elif [ $ASCII_VAL_WORD_2 -gt $ASCII_VAL_WORD_1 ] &&\
                 [ "$FIRST_CHAR_2" \> "$FIRST_CHAR_1" ]; then

                echo "$WORD_2"
            elif [ "$FIRST_CHAR_1" \> "$FIRST_CHAR_2" ]; then
                echo "$WORD_1"
            else
                echo "$WORD_2"
            fi
        else
            echo "$WORD_1"
        fi
    else
        echo $WORD_2
    fi

    return 0
}

sort_alphabetically "apple" "Apple"
sort_alphabetically "Apple" "apple"
sort_alphabetically "apPLE" "Apple"
sort_alphabetically "Apple" "apPLE"
sort_alphabetically "apple" "Banana"
sort_alphabetically "apple" "BANANA"

exit 0

prints:

apple
apple
apPLE
apPLE
Banana
BANANA

Change your syntax. if [[ "Apple" -gt "apple" ]] works as expected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM