[英]Sort Pandas DataFrame by string column that represents (mostly) numbers?
I have data similar to this.我有类似的数据。
data = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='pass'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test3', index = '20', status='fail'),
dict(name = 'test1', index = '2' , status='fail'),
dict(name = 'test3', index = '5:1:50', status='pass'),]
Note, that the type of the 'index' column is str.请注意,“索引”列的类型是 str。 Since it has some irregular entries, I cannot easily convert this to a numeric type.
由于它有一些不规则的条目,我不能轻易地将其转换为数字类型。 (If this was possible I would not have this question.)
(如果这是可能的,我不会有这个问题。)
First I convert it into a DataFrame:首先,我将其转换为 DataFrame:
df = pd.DataFrame(data)
This gives me这给了我
name index status
0 test1 1 fail
1 test3 3 pass
2 test1 11 pass
3 test1 1 2 14 56 fail
4 test3 20 fail
5 test1 2 fail
6 test3 5:1:50 pass
Next I sort it:接下来我对其进行排序:
df1 = df.sort_values(by=['name','index'])
Since the 'index' column is 'str', it will be sorted lexically.由于 'index' 列是 'str',它将按词法排序。
name index status
0 test1 1 fail
3 test1 1 2 14 56 fail
2 test1 11 pass
5 test1 2 fail
4 test3 20 fail
1 test3 3 pass
6 test3 5:1:50 pass
What I actually want is this:我真正想要的是:
name index status
0 test1 1 fail
5 test1 2 fail
2 test1 11 pass
3 test1 1 2 14 56 fail
1 test3 3 pass
4 test3 20 fail
6 test3 5:1:50 pass
The irregular values in row numbers 4 and 7 (DF indices 3 and 6) could also go to the beginning of each test group.第 4 行和第 7 行(DF 索引 3 和 6)中的不规则值也可以 go 到每个测试组的开头。 The key point is, that the values of the 'index' column, that could be converted to a numerical representation, shall be sorted numerically.
关键点是,可以转换为数字表示的“索引”列的值应按数字排序。 And preferably in-place.
最好就地。 How?
如何?
One possibility is to make a column that will give you the length of the index.一种可能性是创建一个列,该列将为您提供索引的长度。
df['sort'] = df['index'].str.len()
df['sort2'] = df['index'].str[0]
df1 = df.sort_values(by=['name','sort','sort2'])
df1 = df1.drop(columns = ['sort','sort2'])
This will sort by the name and a temporary column ( __ix
) that is the first integer found (consecutive digits) in each 'index'
string:这将按名称和临时列 (
__ix
) 排序,该列是在每个'index'
字符串中找到的第一个 integer (连续数字):
Update : You can also use:更新:您还可以使用:
df = (
df
.assign(
__ix=df['index'].str.extract(r'([0-9]+)').astype(int)
)
.sort_values(['name', '__ix'])
.drop('__ix', axis=1) # optional: remove the tmp column
.reset_index(drop=True) # optional: leaves the index scrambled
)
Original :原文:
df = (
df
.assign(
__ix=df['index']
.apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
)
.sort_values(['name', '__ix'])
.drop('__ix', axis=1)
.reset_index(drop=True)
)
On your data (thanks for providing an easy reproducible example), first check what that __ix
column is:在您的数据上(感谢您提供了一个简单的可重现示例),首先检查
__ix
列是什么:
df['index'].apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
# out:
0 1
1 3
2 11
3 1
4 20
5 2
6 5
After sorting, your df becomes:排序后,您的 df 变为:
name index status
0 test1 1 fail
1 test1 1 2 14 56 fail
2 test1 2 fail
3 test1 11 pass
4 test3 3 pass
5 test3 5:1:50 pass
6 test3 20 fail
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.