简体   繁体   English

Pandas 中的字典错误?

[英]A Lexicographical Bug in Pandas?

Please take this question lightly as asked from curiosity:出于好奇,请轻视这个问题:

As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓我试图查看 MultiIndex 中的切片如何工作时,我遇到了以下情况↓

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

Returns:返回:

a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

NOTE that the indices are not in the sorted order ie.请注意,索引不是按排序顺序排列的,即。 a, c, b is the order which will result in the expected error that we want while slicing. a, c, b是在切片时导致我们想要的预期误差的顺序。

# When we do slicing
data.loc["a":"c"]

Errors like:错误如:

UnsortedIndexError

----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

That's expected.这是预期的。 But now, after doing the following steps:但是现在,在执行以下步骤后:

# Making a DataFrame
data = data.unstack()

# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])

# Which looks like 
   1  2
a  5  0
c  8  6
b  6  3

# Then again making series
data = data.stack()

# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)


# Which looks like before
a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

The Problem问题

So, now the process is: Series → Unstack → DataFrame → Stack → Series所以,现在的流程是: Series → Unstack → DataFrame → Stack → Series

Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!现在,如果我像以前一样进行切片(仍然使用未排序的索引),我们不会收到任何错误!

# The same slicing
data.loc["a":"c"]

Results without an error:没有错误的结果:

a  1    5
   2    0
c  1    8
   2    6
dtype: int32

Even if the data.index.is_monotonicFalse .即使data.index.is_monotonicFalse Then still why can we slice?那为什么还要切片呢?

So the question is: WHY?所以问题是:为什么? . .

I hope you got the understanding of the situation here.我希望你对这里的情况有所了解。 Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.因为看,同一系列之前报错, unstackstack操作之后没有报错。

So is that a bug, or a new concept that I am missing here?那么这是一个错误,还是我在这里遗漏的一个新概念?

Thanks!谢谢!
Aayush ∞ Shah阿尤什 ∞ 沙阿

UPDATE : I have used the data.reindex() so to unsort that once more.更新:我已经使用了data.reindex()以便再次取消排序。 Please have a look at it again.请再看一遍。

The difference between you 2 dataframes is the following: 2个数据帧之间的区别如下:

index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

data = pd.Series(np.random.randint(10, size=6), index=index)

data2 = data.unstack().reindex(["a", "c", "b"]).stack()

>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])

>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

Even if your two indexes are the same appearance (values), the internal index (codes) are differents.即使你的两个索引是相同的外观(值),内部索引(代码)也是不同的。

Check this method of MultiIndex :检查MultiIndex 这种方法

        Create a new MultiIndex from the current to monotonically sorted
        items IN the levels. This does not actually make the entire MultiIndex
        monotonic, JUST the levels.

        The resulting MultiIndex will have the same outward
        appearance, meaning the same .values and ordering. It will also
        be .equals() to the original.

Old answer旧答案

# Making a DataFrame
data = data.unstack()

# Which looks like         # <- WRONG
   1  2                    #    1  2
a  5  0                    # a  8  0
c  8  6                    # b  4  1
b  6  3                    # c  7  6

# Then again making series
data = data.stack()

# Which looks like before  # <- WRONG
a  1    5                  # a  1    2
   2    0                  #    2    1
c  1    8                  # b  1    0
   2    6                  #    2    1
b  1    6                  # c  1    3
   2    3                  #    2    9
dtype: int32

If you want to use slicing, you have to check if the index is monotonic:如果要使用切片,则必须检查索引是否单调:

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

>>> data.index.is_monotonic
False

>>> data.unstack().stack().index.is_monotonic
True

>>> data.sort_index().index.is_monotonic
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM