简体   繁体   English

自然排序 Pandas DataFrame

[英]Naturally sorting Pandas DataFrame

I have a pandas DataFrame with indices I want to sort naturally.我有一个带有我想要自然排序的索引的 Pandas DataFrame。 Natsort doesn't seem to work. Natsort 似乎不起作用。 Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process.在构建 DataFrame 之前对索引进行排序似乎没有帮助,因为我对 DataFrame 所做的操作似乎在过程中弄乱了排序。 Any thoughts on how I can resort the indices naturally?关于如何自然地使用索引的任何想法?

from natsort import natsorted
import pandas as pd

# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted 
c = natsorted(a)

# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)

print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)

Now that pandas has support for key in both sort_values and sort_index you should now refer to this other answer and send all upvotes there as it is now the correct answer.既然pandassort_valuessort_index中都支持key ,您现在应该参考这个其他答案并将所有赞成票发送到那里,因为它现在是正确答案。

I will leave my answer here for people stuck on old pandas versions, or as a historical curiosity.对于停留在旧pandas版本上的人,或作为历史好奇心,我将在这里留下我的答案。


The accepted answer answers the question being asked.接受的答案回答了所提出的问题。 I'd like to also add how to use natsort on columns in a DataFrame , since that will be the next question asked.我还想添加如何在natsort中的列上使用DataFrame ,因为这将是下一个问题。

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted, index_natsorted, order_by_index

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df
Out[4]: 
         a   b
0hr     a5  b1
128hr   a1  b1
72hr   a10  b2
48hr    a2  b2
96hr   a12  b1

As the accepted answer shows, sorting by the index is fairly straightforward:正如接受的答案所示,按索引排序非常简单:

In [5]: df.reindex(index=natsorted(df.index))
Out[5]: 
         a   b
0hr     a5  b1
48hr    a2  b2
72hr   a10  b2
96hr   a12  b1
128hr   a1  b1

If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered.如果要以相同的方式对列进行排序,则需要按所需列的重新排序顺序对索引进行排序。 natsort provides the convenience functions index_natsorted and order_by_index to do just that. natsort提供了方便的函数index_natsortedorder_by_index来做到这一点。

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip (or itertools.izip on Python2) to specify sorting on multiple columns.如果要按任意数量的列(或列和索引)重新排序,可以使用zip (或 Python2 上的itertools.izip )指定对多列进行排序。 The first column given will be the primary sorting column, then secondary, then tertiary, etc...给出的第一列将是主要排序列,然后是第二列,然后是第三列,等等......

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]: 
         a   b
0hr     a5  b1
96hr   a12  b1
128hr   a1  b1
48hr    a2  b2
72hr   a10  b2

Here is an alternate method using Categorical objects that I have been told by the pandas devs is the "proper" way to do this.这是pandas开发人员告诉我的使用Categorical对象的替代方法是执行此操作的“正确”方法。 This requires (as far as I can see) pandas >= 0.16.0.这需要(据我所知)pandas >= 0.16.0。 Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex which will allow this method to be used on an index.目前,它仅适用于列,但显然在 Pandas >= 0.17.0 中,他们将添加CategoricalIndex ,这将允许在索引上使用此方法。

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df.a = df.a.astype('category')

In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)

In [6]: df.b = df.b.astype('category')

In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)

In [9]: df.sort('a')
Out[9]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [10]: df.sort('b')
Out[10]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

In [11]: df.sort(['b', 'a'])
Out[11]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

The Categorical object lets you define a sorting order for the DataFrame to use. Categorical对象允许您定义要使用的DataFrame的排序顺序。 The elements given when calling reorder_categories must be unique, hence the call to set for column "b".调用reorder_categories时给出的元素必须是唯一的,因此调用set列“b”。

I leave it to the user to decide if this is better than the reindex method or not, since it requires you to sort the column data independently before sorting within the DataFrame (although I imagine that second sort is rather efficient).我让用户来决定这是否比reindex方法更好,因为它要求您在DataFrame排序之前独立地对列数据进行排序(尽管我认为第二种排序相当有效)。


Full disclosure, I am the natsort author.完全披露,我是natsort作者。

If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:如果要对 df 进行排序,只需对索引或数据进行排序并直接分配给 df 的索引,而不是尝试将 df 作为 arg 传递,因为这会产生一个空列表:

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

Note that df.index = natsorted(df.index) also works请注意, df.index = natsorted(df.index)也有效

if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:如果您将 df 作为 arg 传递,它会产生一个空列表,在这种情况下,因为 df 是空的(没有列),否则它将返回排序的列,这不是您想要的:

In [10]:

natsorted(df)
Out[10]:
[]

EDIT编辑

If you want to sort the index so that the data is reordered along with the index then use reindex :如果要对索引进行排序以便数据与索引一起重新排序,请使用reindex

In [13]:

df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
       0
0hr    0
128hr  1
72hr   2
48hr   3
96hr   4
In [14]:

df = df*2
df
Out[14]:
       0
0hr    0
128hr  2
72hr   4
48hr   6
96hr   8
In [15]:

df.reindex(index=natsorted(df.index))
Out[15]:
       0
0hr    0
48hr   6
72hr   4
96hr   8
128hr  2

Note that you have to assign the result of reindex to either a new df or to itself, it does not accept the inplace param.请注意,您必须将reindex的结果分配给新的 df 或它本身,它不接受inplace参数。

Using sort_values for pandas >= 1.1.0pandas >= 1.1.0使用sort_values pandas >= 1.1.0

With the new key argument in DataFrame.sort_values , since pandas 1.1.0 , we can directly sort a column without setting it as an index using natsort.natsort_keygen :使用DataFrame.sort_values的新key参数,从pandas 1.1.0 ,我们可以直接对列进行排序,而无需使用natsort.natsort_keygen将其设置为索引:

df = pd.DataFrame({
    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
    "value": [10, 20, 30, 40, 50]
})

    time  value
0    0hr     10
1  128hr     20
2   72hr     30
3   48hr     40
4   96hr     50
from natsort import natsort_keygen

df.sort_values(
    by="time",
    key=natsort_keygen()
)

    time  value
0    0hr     10
3   48hr     40
2   72hr     30
4   96hr     50
1  128hr     20

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM