简体   繁体   English

按多列排序熊猫数据框而忽略大小写

[英]Sorting pandas dataframe by multiple columns ignoring case

I have a dataframe in a Python script (using pandas) that needs to be sorted by multiple columns, but the case of the values currently messes up the sorting. 我在Python脚本(使用pandas)中有一个数据框,该数据框需要按多列进行排序,但是当前值的大小写弄乱了排序。 For example a and A are not equally sorted. 例如,a和A的排序不相等。 First, the upper-case letters are sorted and then the lower-case ones. 首先,对大写字母进行排序,然后对小写字母进行排序。 Is there any easy way to sort them ignoring case. 有什么简单的方法可以忽略大小写对它们进行排序。 Currently I have something like this: 目前我有这样的事情:

df = df.sort(['column1', 'column2', 'column3', 'column4', 'column5', 'column6', 'column7'], ascending=[True, True, True, True, True, True, True])

It is important that the case needs to be ignored for all of the columns and the values mustn't change their case in the final sorted dataframe. 重要的是,所有列的大小写都需要忽略,并且值在最终排序的数据框中不得更改其大小写。

For example column 1 could be sorted like this (ignoring case): 例如,第1列可以这样排序(忽略大小写):

Aaa
aaB
aaC
Bbb
bBc
bbD
CCc
ccd

Also, it would be awesome, if the functionality would work with x number of columns (no hard-coding). 另外,如果该功能可以使用x列数(无需硬编码),那将非常棒。

if you just want to sort according to lower, you could use something like this: 如果您只想按较低的顺序进行排序,则可以使用以下方法:

def sort_naive_lowercase(df, columns, ascending=True):
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].str.lower()
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you expect unicode problems, you might do something like this (borrowing from @nick-hale's comment): 如果您希望unicode问题,则可以执行以下操作(从@ nick-hale的注释中借用):

def sort_by_caseless_columns(df, columns, ascending=True):
    # https://stackoverflow.com/a/29247821/1562285
    import unicodedata

    def normalize_caseless(text):
        return unicodedata.normalize("NFKD", text.casefold())
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].apply(normalize_caseless)
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you have more possible arguments to pass to the sort_values , you can use **kwargs 如果您有更多可能要传递给sort_values参数,则可以使用**kwargs

If not all the columns are strings, but some are numerical, you might have to include an additional mask or set for the non-numerical columns 如果不是所有的列都是字符串,但有些列是数字列,则可能必须为非数字列包括附加的maskset

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM