简体   繁体   English

pandas sort_values() 在平局的情况下是否具有确定性?

[英]Is pandas sort_values() deterministic in case of ties?

I was wondering whether pandas sorting with sort_values() is a deterministic operation in case of ties, ie if calling df.sort_values('foo') would always return the same sorting, no matter how often I run it?我想知道 pandas 排序与sort_values()是否是一个确定性操作,以防出现平局,即无论我多久运行一次,调用df.sort_values('foo')是否总是返回相同的排序? One example would be一个例子是

df=pd.DataFrame(np.random.randint(1, 3, 5),columns=["foo"])
df.sort_values(['foo'])

    foo
0   1
4   1
1   2
2   2
3   2

I understand that the operation is not stable , but is it deterministic?我知道操作不稳定,但它是确定性的吗?

Yes.是的。 If you use kind='quicksort' , the output is deterministic, but not stable.如果您使用kind='quicksort' ,则 output 是确定性的,但不稳定。

The reason why quicksort can be nondeterministic is that all quicksort implementations are made up of three steps:快速排序可能是不确定的原因是所有快速排序实现都由三个步骤组成:

  1. Pick a pivot element.选择一个 pivot 元件。
  2. Divide the list into two lists: the elements smaller than the pivot, and the elements larger than the pivot.将列表分为两个列表:小于 pivot 的元素和大于 pivot 的元素。
  3. Run quicksort on both halves of the list.对列表的两半运行快速排序。

There are three popular ways of implementing step 1.实施步骤 1 有三种流行的方法。

  1. The first way is to arbitrarily pick a pivot element, such as picking the first element, or middle element.第一种方式是任意挑一个pivot元素,比如挑第一个元素,或者中间元素。
  2. The second way is to pick an element at random.第二种方法是随机选择一个元素。
  3. The third way is to pick several elements at random, and compute a median (or median of medians.)第三种方法是随机选择几个元素,并计算中位数(或中位数的中位数)。

The first way is deterministic.第一种方式是确定性的。 The second and third ways are nondeterministic.第二种和第三种方式是不确定的。

So, which kind of quicksort does Pandas implement?那么,Pandas 实现了哪种快速排序呢? Pandas dispatches sort_values() to sort_index(), which uses numpy's argsort() to do the sort. Pandas将 sort_values() 分派给 sort_index(),后者使用 numpy 的 argsort()进行排序。 How does numpy implement picking the pivot? numpy如何实现采摘pivot? That's defined in this file .这是在这个文件中定义的。

The pivot element is vp . pivot 元素是vp It is chosen like so:它是这样选择的:

/* quicksort partition */
pm = pl + ((pr - pl) >> 1);
[...]
vp = *pm;

How does this work?这是如何运作的? The variables pr and pl are pointers to the beginning and end of the region to be sorted, respectively.变量prpl分别是指向要排序的区域的开始和结束的指针。 If you subtract the two, that is the number of elements to be sorted.如果将两者相减,那就是要排序的元素数。 If you shift that left once, that's dividing it by 2. So the pm pointer points to an element halfway into the array.如果将其向左移动一次,则将其除以 2。因此pm指针指向数组中间的一个元素。 Then pm is de-referenced to get the pivot element.然后取消引用pm以获得 pivot 元素。 (Note that this isn't necessarily the median element of the array, It could be the smallest element. or the largest.) (请注意,这不一定是数组的中间元素,它可能是最小的元素。或最大的。)

This means that numpy uses the first way to pick elements - it is arbitrary but deterministic.这意味着 numpy 使用第一种方法来选择元素——它是任意的,但具有确定性。 The tradeoff for this is that for some orderings of data, the sort performance will degrade from O(N log N) to O(N^2).对此的权衡是,对于某些数据排序,排序性能将从 O(N log N) 下降到 O(N^2)。

More information about implementing quicksort 有关实施快速排序的更多信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM