按代表（主要）数字的字符串列对 Pandas DataFrame 进行排序？

Question

I have data similar to this.我有类似的数据。

data = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='pass'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test3', index = '20', status='fail'),
dict(name = 'test1', index = '2' , status='fail'),
dict(name = 'test3', index = '5:1:50', status='pass'),]

Note, that the type of the 'index' column is str.请注意，“索引”列的类型是 str。 Since it has some irregular entries, I cannot easily convert this to a numeric type.由于它有一些不规则的条目，我不能轻易地将其转换为数字类型。 (If this was possible I would not have this question.) （如果这是可能的，我不会有这个问题。）

First I convert it into a DataFrame:首先，我将其转换为 DataFrame：

df = pd.DataFrame(data)

This gives me这给了我

    name    index     status
0   test1   1         fail
1   test3   3         pass
2   test1   11        pass
3   test1   1 2 14 56 fail
4   test3   20        fail
5   test1   2         fail
6   test3   5:1:50    pass

Next I sort it:接下来我对其进行排序：

df1 = df.sort_values(by=['name','index'])

Since the 'index' column is 'str', it will be sorted lexically.由于 'index' 列是 'str'，它将按词法排序。

    name    index     status
0   test1   1         fail
3   test1   1 2 14 56 fail
2   test1   11        pass
5   test1   2         fail
4   test3   20        fail
1   test3   3         pass
6   test3   5:1:50    pass

What I actually want is this:我真正想要的是：

    name    index     status
0   test1   1         fail
5   test1   2         fail
2   test1   11        pass
3   test1   1 2 14 56 fail
1   test3   3         pass
4   test3   20        fail
6   test3   5:1:50    pass

The irregular values in row numbers 4 and 7 (DF indices 3 and 6) could also go to the beginning of each test group.第 4 行和第 7 行（DF 索引 3 和 6）中的不规则值也可以 go 到每个测试组的开头。 The key point is, that the values of the 'index' column, that could be converted to a numerical representation, shall be sorted numerically.关键点是，可以转换为数字表示的“索引”列的值应按数字排序。 And preferably in-place.最好就地。 How?如何？

Answer 1

One possibility is to make a column that will give you the length of the index.一种可能性是创建一个列，该列将为您提供索引的长度。

df['sort'] = df['index'].str.len()
df['sort2'] = df['index'].str[0]
df1 = df.sort_values(by=['name','sort','sort2'])
df1 = df1.drop(columns = ['sort','sort2'])

Answer 2

This will sort by the name and a temporary column ( __ix ) that is the first integer found (consecutive digits) in each 'index' string:这将按名称和临时列 ( __ix ) 排序，该列是在每个'index'字符串中找到的第一个 integer （连续数字）：

Update : You can also use:更新：您还可以使用：

df = (
    df
    .assign(
        __ix=df['index'].str.extract(r'([0-9]+)').astype(int)
    )
    .sort_values(['name', '__ix'])
    .drop('__ix', axis=1)  # optional: remove the tmp column
    .reset_index(drop=True)  # optional: leaves the index scrambled
)

Original :原文：

df = (
    df
    .assign(
        __ix=df['index']
        .apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
    )
    .sort_values(['name', '__ix'])
    .drop('__ix', axis=1)
    .reset_index(drop=True)
)

On your data (thanks for providing an easy reproducible example), first check what that __ix column is:在您的数据上（感谢您提供了一个简单的可重现示例），首先检查__ix列是什么：

df['index'].apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
# out:
0     1
1     3
2    11
3     1
4    20
5     2
6     5

After sorting, your df becomes:排序后，您的 df 变为：

    name      index status
0  test1          1   fail
1  test1  1 2 14 56   fail
2  test1          2   fail
3  test1         11   pass
4  test3          3   pass
5  test3     5:1:50   pass
6  test3         20   fail

按代表（主要）数字的字符串列对 Pandas DataFrame 进行排序？

问题描述

2 个解决方案

解决方案1
0 2020-12-03 15:00:37

解决方案2
0 已采纳 2020-12-03 15:57:05

按代表（主要）数字的字符串列对 Pandas DataFrame 进行排序？

问题描述

2 个解决方案

解决方案1 0 2020-12-03 15:00:37

解决方案2 0 已采纳 2020-12-03 15:57:05

解决方案1
0 2020-12-03 15:00:37

解决方案2
0 已采纳 2020-12-03 15:57:05