简体   繁体   English

在 Pandas 中对带有数字的字符串列进行排序

[英]Sort string columns with numbers in it in Pandas

I want to order my table by a column.我想按列排序我的桌子。 The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2.该列是一个包含数字的字符串,例如 ASH11、ASH2、ASH1 等。问题是使用方法sort_values将执行“字符”顺序,因此示例中的列将是这样的顺序--> ASH1、ASH11、ASH2。 And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).我想要这样的顺序--> AS20H1、AS20H2、AS20H11(考虑到最后一个数字)。

I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two.我虽然想取字符串的最后一个字符,但有时只是最后一个,在其他情况下是最后两个。 The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (ie some cases the name is ASH1, ASGH22, ASHGT3, etc)另一种方法(从开头获取字符)也不起作用,因为字符串并不总是来自相同的长度(即某些情况下名称是 ASH1、ASGH22、ASHGT3 等)

Use key parameter (new in 1.1.0 )使用key参数( 1.1.0新增)

df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))

You could maybe extract the integers from your column and then use it to sort your dataFrame您可以从列中提取整数,然后使用它对您的 dataFrame 进行排序

  df["new_index"] = df.yourColumn.str.extract('(\d+)')
  df.sort_values(by=["new_index"], inplace=True)

In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)如果您在“new_index”列中得到一些 NA,您可以使用 sort_values 方法中的选项 na_position 来选择将它们放在哪里(开始或结束)

Using list comprehension and regular expression:使用列表理解和正则表达式:

>>> import pandas as pd
>>> import re #Regular expression

>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
     label
0   AS20H1
1   AS20H2
2  AS20H11
3     ASH1
4   ASGH22
5   ASHGT3

r'(\d+)(?..*\d)' Matches the last number in a string r'(\d+)(?..*\d)'匹配字符串中的最后一个数字

>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
     label  sort_int
0   AS20H1         1
1   AS20H2         2
2  AS20H11        11
3     ASH1         1
4   ASGH22        22
5   ASHGT3         3

>>> a.sort_values(by='sort_int',ascending=True)
     label  sort_int
0   AS20H1         1
3     ASH1         1
1   AS20H2         2
5   ASHGT3         3
2  AS20H11        11
4   ASGH22        22

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM