简体   繁体   中英

Does sort -n handle ties predictably when the --stable option is NOT provided? If it does, how?

Here it looks like the space after the 3 in both rows breaks the numerical sorting and lets the alphabetic sorting kick in, so that 11 < 2 :

$ echo -e '3 2\n3 11' | sort -n
3 11
3 2

In man sort , I read

 -s, --stable stabilize sort by disabling last-resort comparison

which implies that without -s a last-resort comparison is done (between ties, because -s does not affect non-ties).

So the question is: how is this last-resort comparison accomplished? A reference to the source code would be welcome, if necessary to answer the question.

This answer Unix deduces, from experimentation, that the sorting of ties is lexicographic.

Does the standard/POSIX say anything about this?

Here it looks like the space after the 3 in both rows breaks the numerical sorting and lets the alphabetic sorting kick in

sort -n is not sort -n -k1,1 -k2,2 . sort -n interprets the whole line (not fields ,) as a number, like atoi("3 11") gives 3 . Then those numbers are sorted. Because sort_them(atoi("3 11"), atoi("3 2")) are unsorted, because both are number 3 , last-resort comparison sort kicks in.

how is this last-resort comparison accomplished?

The idea is that the whole lines are compared as-if by strcmp or similar (ie. strcoll ). Because 1 comes before 2 , strcmp("3 11", "3 2") sorts 3 11 as the first. No options are taken into account, -n is not taken into account.

A reference to the source code would be welcome, if necessary to answer the question.

It's actually xmemcoll0 in GNU sort to take collating into account in coreutils sort.c#L2653 in compare (struct line const *a, struct line const *b) and there's memcmp as a fallback when LC_COLLATE is not set.

I see in openbsd sort it's somehwere around openbsd/sort/coll.c#L528 str_list_coll(struct bwstring *str1, struct sort_list_item **ss2) but also in list_coll_offset() , where if all keys compare equal top_level_str_coll is called which just sorts the whole lines.

Does the standard/POSIX say anything about this?

If "this" refers to stable sort and last-resort comparision, then sure. Let's copy the whole paragraph from POSIX sort emphasis mine:

Comparisons shall be based on one or more sort keys extracted from each line of input (or, if no sort keys are specified, the entire line up to, but not including, the terminating ), and shall be performed using the collating sequence of the current locale. If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

and

Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see XBD LC_COLLATE). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified.

Question: How is the last resort comparison done?

This is quickly answered in documentation of GNU coreutils:

A pair of lines is compared as follows: sort compares each pair of fields (see --key ), in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. If no key fields are specified, sort uses a default key of the entire line. Finally, as a last resort when all keys compare equal, sort compares entire lines as if no ordering options other than --reverse ( -r ) were specified . The --stable ( -s ) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. The --unique ( -u ) option also disables the last-resort comparison.

Unless otherwise specified, all comparisons use the character collating sequence specified by the LC_COLLATE

source: Sort Invocation GNU Coreutils

This means that the final resort will sort according to the sorting order of LC_COLLATE, ie lexicographically (mostly).

POSIX, on the other hand adds a final ultra-last resort option which is stricter.

If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

source: Sort POSIX standard

I am not certain if this is implemented in GNU sort, since it is not a requirement. Nonetheless, POSIX strongly recommends it (See Rationale last paragraph )

What does this mean in case of the OP?

There is an uncomfortable misunderstanding of the key-definitions. Assume you do something like

$ sort --option -k1,3 file

It is often understood that sort will first sort on field 1, then 2 and finally 3 using --option . This is incorrect. It will use the key to be defined as the substring consisting of fields 1 till 3. And in case when two lines collate equally, sort will perform the last-resort option (see earlier)

aa  bb cc xxxxxxxx
---------           <<< rule1: according to the key
------------------  <<< rule2: lexicographical sort (last resort)

Using GNU sort, you can see which substring is used for the sort. This is done with the --debug option. Here you see the difference between 3 simple cases:

# Sort lexicographically with full line
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort --debug
sort: using ?en_GB.UTF-8? sorting rules
ab c d
______
efg h i
_______
# -------------------------------------------------------------------
# Sort lexicographically with the substring formed by field 1 and 2
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort -k1,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
ab c d
____
______
efg h i
_____
_______
# -------------------------------------------------------------------
# Sort lexicographically with field 1 followed by field 2
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort -k1,1 -k2,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
sort: leading blanks are significant in key 2; consider also specifying 'b'
ab c d
__
  __
______
efg h i
___
   __
_______

When you do a numeric sort (using -n or -g ), sort will attempt to extract a number from the key (1234abc leads to 1234) and use that number for the sorting.

# Sort numerically with full line
# -------------------------------------------------------------------
$ echo -e "3a 11a\n3b 2b" | sort -n --debug
sort: using ?en_GB.UTF-8? sorting rules
3a 11a
_         # numeric on full line
______    # lexicographically on full line  (last resort)
3b 2b
_         # numeric on full line
_____     # lexicographically on full line  (last resort)
# -------------------------------------------------------------------
# Sort numerically with field 1 then field 2
# -------------------------------------------------------------------
$ echo -e "3a 11a\n3b 2b" | sort -n -k1,1 -k2,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
3b 2b
_         # numeric on field 1
   _      # numeric on field 2
_____     # lexicographically on full line  (last resort)
3a 11a
_         # numeric on field 1
   __     # numeric on field 2
______    # lexicographically on full line  (last resort)

As you notice in these two cases, even though the first field can be ordered lexicographically 3a < 3b , it is ignored as we only pick the number from the key.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM