简体   繁体   中英

What is the purpose of sorted sets?

Clojure has a function sorted-set which creates a PersistentTreeSet object. As the name implies, sorted-set creates a sorted collection of unique objects.

When are sorted sets useful? When is it better to use sorted-set than sort and distinct ?

=> (apply sorted-set [2 2 1 1 3 3])
#{1 2 3}
=> (sort (distinct [2 2 1 1 3 3]))
(1 2 3)

Sorted sets are useful when you need set semantics – fast contains? , conj and disj (= element removals), as explained by Leon – and traversals in a well-defined order. In the case of the built-in sorted sets (and maps), ordered traversals are possible over the entire set ( seq , rseq ) and any "subrange" ( subseq , rsubseq ) between two keys, inclusive or exclusive.

If you're willing to reach for out-of-core collections, the Contrib library data.avl (of which I am the author and maintainer) offers a flavour of sorted sets and maps with additional functionality – nth for access to set elements by rank, rank-of for discovering the rank of an element in the set, nearest neighbour queries, and "subrange" and split-like operations that return completely functional subsets of the input collection (think subseq returning a completely functional subset of the original, not just a seq, without holding on to any elements of the original not present in the subset for the purposes of GC). All of these operate in O(log n) time worst-case, just like the standard sorted set operations.

If you only need contains? + conj + disj , you'll probably want to use hash sets instead, since they tend to deliver better performance for these operations. It is worth noting, however, that if you anticitpate adding inputs from a possibly malicious outside source to your sets, you may want to go with sorted sets even if you don't care about the order. This is because hash sets' performance degrades to O(n) in the presence of hash collisions (which an adversary could force, the hash function in use being deterministic and fixed in advance), whereas sorted sets' O(log n) is a hard guarantee.

If you only need to sort your input collection once and then traverse it in whole, or various prefixes/suffixes of it, repeatedly, then building up a sorted vector of unique items may indeed be the better option. A sorted set may still be preferable even for a traversal-only workload, though, if you need the subseq / rsubseq feature of starting at an arbitrary element of the collection ( (subseq a-set >= 5) = seq over those elements of a-set which are >= 5 with respect to a-set 's ordering).

Personally, I use sorted sets if I want an ordered data structure without duplicates as I add elements in. That being said, I start with an empty set rather than apply it to a list.

The time I would use sort and distinct is if I had any other data structure like a list that I want to order and remove duplicates.

Basically, applying a set gives you a new object with unique elements while distinct acts on the same list reference.

The difference between a sorted set and the result of calling sort and distinct is that the resulting type is a set.

This gives you O(log N) performance (think binary search) to check whether an element is in the collection ( contains? ) or to add one ( conj ), while on a list, returned by sort and distinct you'd get worse characteristics to achieve the same behavior by default.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM