按多列排序熊猫数据框而忽略大小写

Question

I have a dataframe in a Python script (using pandas) that needs to be sorted by multiple columns, but the case of the values currently messes up the sorting. 我在Python脚本（使用pandas）中有一个数据框，该数据框需要按多列进行排序，但是当前值的大小写弄乱了排序。 For example a and A are not equally sorted. 例如，a和A的排序不相等。 First, the upper-case letters are sorted and then the lower-case ones. 首先，对大写字母进行排序，然后对小写字母进行排序。 Is there any easy way to sort them ignoring case. 有什么简单的方法可以忽略大小写对它们进行排序。 Currently I have something like this: 目前我有这样的事情：

df = df.sort(['column1', 'column2', 'column3', 'column4', 'column5', 'column6', 'column7'], ascending=[True, True, True, True, True, True, True])

It is important that the case needs to be ignored for all of the columns and the values mustn't change their case in the final sorted dataframe. 重要的是，所有列的大小写都需要忽略，并且值在最终排序的数据框中不得更改其大小写。

For example column 1 could be sorted like this (ignoring case): 例如，第1列可以这样排序（忽略大小写）：

Aaa
aaB
aaC
Bbb
bBc
bbD
CCc
ccd

Also, it would be awesome, if the functionality would work with x number of columns (no hard-coding). 另外，如果该功能可以使用x列数（无需硬编码），那将非常棒。

Answer 1

if you just want to sort according to lower, you could use something like this: 如果您只想按较低的顺序进行排序，则可以使用以下方法：

def sort_naive_lowercase(df, columns, ascending=True):
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].str.lower()
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you expect unicode problems, you might do something like this (borrowing from @nick-hale's comment): 如果您希望unicode问题，则可以执行以下操作（从@ nick-hale的注释中借用）：

def sort_by_caseless_columns(df, columns, ascending=True):
    # https://stackoverflow.com/a/29247821/1562285
    import unicodedata

    def normalize_caseless(text):
        return unicodedata.normalize("NFKD", text.casefold())
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].apply(normalize_caseless)
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you have more possible arguments to pass to the sort_values , you can use **kwargs 如果您有更多可能要传递给sort_values参数，则可以使用**kwargs

If not all the columns are strings, but some are numerical, you might have to include an additional mask or set for the non-numerical columns 如果不是所有的列都是字符串，但有些列是数字列，则可能必须为非数字列包括附加的mask或set

按多列排序熊猫数据框而忽略大小写

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-06-07 10:11:20

按多列排序熊猫数据框而忽略大小写

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-06-07 10:11:20

解决方案1
2 已采纳 2017-06-07 10:11:20