[英]Naturally sorting Pandas DataFrame
I have a pandas DataFrame with indices I want to sort naturally.我有一个带有我想要自然排序的索引的 Pandas DataFrame。 Natsort doesn't seem to work. Natsort 似乎不起作用。 Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process.在构建 DataFrame 之前对索引进行排序似乎没有帮助,因为我对 DataFrame 所做的操作似乎在过程中弄乱了排序。 Any thoughts on how I can resort the indices naturally?关于如何自然地使用索引的任何想法?
from natsort import natsorted
import pandas as pd
# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted
c = natsorted(a)
# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)
print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)
pandas
has support for key
in both sort_values
and sort_index
you should now refer to this other answer and send all upvotes there as it is now the correct answer.既然pandas
在sort_values
和sort_index
中都支持key
,您现在应该参考这个其他答案并将所有赞成票发送到那里,因为它现在是正确答案。 I will leave my answer here for people stuck on old pandas
versions, or as a historical curiosity.对于停留在旧pandas
版本上的人,或作为历史好奇心,我将在这里留下我的答案。
The accepted answer answers the question being asked.接受的答案回答了所提出的问题。 I'd like to also add how to use natsort
on columns in a DataFrame
, since that will be the next question asked.我还想添加如何在natsort
中的列上使用DataFrame
,因为这将是下一个问题。
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted, index_natsorted, order_by_index
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df
Out[4]:
a b
0hr a5 b1
128hr a1 b1
72hr a10 b2
48hr a2 b2
96hr a12 b1
As the accepted answer shows, sorting by the index is fairly straightforward:正如接受的答案所示,按索引排序非常简单:
In [5]: df.reindex(index=natsorted(df.index))
Out[5]:
a b
0hr a5 b1
48hr a2 b2
72hr a10 b2
96hr a12 b1
128hr a1 b1
If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered.如果要以相同的方式对列进行排序,则需要按所需列的重新排序顺序对索引进行排序。 natsort
provides the convenience functions index_natsorted
and order_by_index
to do just that. natsort
提供了方便的函数index_natsorted
和order_by_index
来做到这一点。
In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip
(or itertools.izip
on Python2) to specify sorting on multiple columns.如果要按任意数量的列(或列和索引)重新排序,可以使用zip
(或 Python2 上的itertools.izip
)指定对多列进行排序。 The first column given will be the primary sorting column, then secondary, then tertiary, etc...给出的第一列将是主要排序列,然后是第二列,然后是第三列,等等......
In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]:
a b
0hr a5 b1
96hr a12 b1
128hr a1 b1
48hr a2 b2
72hr a10 b2
Here is an alternate method using Categorical
objects that I have been told by the pandas
devs is the "proper" way to do this.这是pandas
开发人员告诉我的使用Categorical
对象的替代方法是执行此操作的“正确”方法。 This requires (as far as I can see) pandas >= 0.16.0.这需要(据我所知)pandas >= 0.16.0。 Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex
which will allow this method to be used on an index.目前,它仅适用于列,但显然在 Pandas >= 0.17.0 中,他们将添加CategoricalIndex
,这将允许在索引上使用此方法。
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df.a = df.a.astype('category')
In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)
In [6]: df.b = df.b.astype('category')
In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)
In [9]: df.sort('a')
Out[9]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [10]: df.sort('b')
Out[10]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
In [11]: df.sort(['b', 'a'])
Out[11]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
The Categorical
object lets you define a sorting order for the DataFrame
to use. Categorical
对象允许您定义要使用的DataFrame
的排序顺序。 The elements given when calling reorder_categories
must be unique, hence the call to set
for column "b".调用reorder_categories
时给出的元素必须是唯一的,因此调用set
列“b”。
I leave it to the user to decide if this is better than the reindex
method or not, since it requires you to sort the column data independently before sorting within the DataFrame
(although I imagine that second sort is rather efficient).我让用户来决定这是否比reindex
方法更好,因为它要求您在DataFrame
排序之前独立地对列数据进行排序(尽管我认为第二种排序相当有效)。
Full disclosure, I am the natsort
author.完全披露,我是natsort
作者。
If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:如果要对 df 进行排序,只需对索引或数据进行排序并直接分配给 df 的索引,而不是尝试将 df 作为 arg 传递,因为这会产生一个空列表:
In [7]:
df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')
Note that df.index = natsorted(df.index)
also works请注意, df.index = natsorted(df.index)
也有效
if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:如果您将 df 作为 arg 传递,它会产生一个空列表,在这种情况下,因为 df 是空的(没有列),否则它将返回排序的列,这不是您想要的:
In [10]:
natsorted(df)
Out[10]:
[]
EDIT编辑
If you want to sort the index so that the data is reordered along with the index then use reindex
:如果要对索引进行排序以便数据与索引一起重新排序,请使用reindex
:
In [13]:
df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
0
0hr 0
128hr 1
72hr 2
48hr 3
96hr 4
In [14]:
df = df*2
df
Out[14]:
0
0hr 0
128hr 2
72hr 4
48hr 6
96hr 8
In [15]:
df.reindex(index=natsorted(df.index))
Out[15]:
0
0hr 0
48hr 6
72hr 4
96hr 8
128hr 2
Note that you have to assign the result of reindex
to either a new df or to itself, it does not accept the inplace
param.请注意,您必须将reindex
的结果分配给新的 df 或它本身,它不接受inplace
参数。
sort_values
for pandas >= 1.1.0
对pandas >= 1.1.0
使用sort_values
pandas >= 1.1.0
With the new key
argument in DataFrame.sort_values
, since pandas 1.1.0
, we can directly sort a column without setting it as an index using natsort.natsort_keygen
:使用DataFrame.sort_values
的新key
参数,从pandas 1.1.0
,我们可以直接对列进行排序,而无需使用natsort.natsort_keygen
将其设置为索引:
df = pd.DataFrame({
"time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
"value": [10, 20, 30, 40, 50]
})
time value
0 0hr 10
1 128hr 20
2 72hr 30
3 48hr 40
4 96hr 50
from natsort import natsort_keygen
df.sort_values(
by="time",
key=natsort_keygen()
)
time value
0 0hr 10
3 48hr 40
2 72hr 30
4 96hr 50
1 128hr 20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.