简体   繁体   English

如何在多个不同位置有效地对包含浮点数和前导+或-号的Python字符串进行排序

[英]How to efficiently sort Python strings containing floating numbers AND a leading + or - sign at multiple and varying locations

Previous solutions focus on strings where numbers are separated from letters by a dash ( sorting strings containing numbers and letters ) or another consistent delimiter (eg, '_', python sort strings with leading numbers alphabetically ). 以前的解决方案着重于字符串,其中数字与字母之间用破折号( 排序包含数字和字母的字符串 )或另一个一致的定界符(例如'_', python以字母开头的数字对字符串进行排序 )分隔字母 The numbers are usually at the same position with respect to the letters. 数字通常相对于字母在同一位置。 These are relatively easy lists, such as 这些是相对容易的列表,例如

l=['101-8', '101-8A', '101-9', '102-1', '103-4', '103-4B', '101-10', '101-11','103-10'] 

or 要么

l=['10_file','11_file','1_file','20_file','21_file','2_file']

I need to sort something like: 我需要排序类似:

listfromhell=['a_+10.9.mrc','a_-10.0.mrc','a_-12.0_b.mrc','az_x_y_+60.13_a.hdf','bc_ab_+15.0_rst.mrc']

The sorting needs to be based on the number that follows the - or + signs (including the signs). 排序需要基于-+符号(包括符号)之后的数字。

Thus, the correct sorting for the list above would be: 因此,以上列表的正确排序将是:

listfromhell=['a_-12.0_b.mrc','a_-10.0.mrc','a_+10.9.mrc','bc_ab_+15.0_rst.mrc','az_x_y_+60.13_a.mrc']

A one-liner similar to what has been previously proposed for easier lists works nicely IF the floating number used for the sorting (with the preceding + or - sign) occurs at the same location always, where "location" means the index at which the sorting element occurs in the list that results from splitting each string element at some sort of consistent delimiter. 如果用于排序(带有+-前面的符号)的浮点数始终出现在同一位置,则类似于以前为简化列表所建议的单线效果很好。排序元素出现在列表中,是由于在某种一致的定界符处分割每个字符串元素而导致的。

For example, a list like this: 例如,这样的列表:

nicelist=['a_b_-12.0_d.mrc','a_r_+10.9_t_z_y.mrc','c_a_-10.0.mrc','bc_ab_+15.0_rst.mrc','az_x_+60.13_a.mrc']

Would be easily sorted with: 可以很容易地排序:

sorted(l, key=lambda s: float(s.split("_")[2].replace('.mrc',''))))

because the floating number always occurs at index '2' after splitting each string using the consistent delimiter '_' 因为在使用一致的定界符'_'分割每个字符串后,浮点数始终出现在索引“ 2 '_'

How can a similarly simple solution be implemented when the index at which the sorting element occurs (2 in nicelist ) is not known a priori? 当排序元素发生的索引(在nicelist为2)不是先验时,如何实现类似的简单解决方案?

And there are multiple increasingly complex cases to this question, such as when the floating point number occurs at random locations, when there are no consistent delimiters, and when there are confounding '+' and '-' signs at other places in addition to preceding the floating point number, as well as confounding digits that are not part of the floating point number. 这个问题有许多越来越复杂的情况,例如,当浮点数出现在随机位置,没有一致的定界符时,以及除前导之外在其他位置还存在混淆的'+''-'符号时浮点数,以及不属于浮点数的混淆数字。 Eg, 例如,

listfromhellandthensome=['a5-_-12.0b.mrc','a+101.9-.mrc','-a11_-10.0.mrc','b-c_ab_+15.0_rs+t.mrc','a + z_-x_y_+6.10334_a4.mrc']

Basically, the ultimate task would be to find an elegant solution (a one-liner would be amazing) to sort a list of string elements for which each element contains a single floating point number of unknown size/length and sign (it can be either positive or negative) and can occur at any arbitrary position within the string, with no known consistent delimiters 基本上,最终的任务是找到一个优雅的解决方案(单线将是惊人的),以对字符串元素列表进行排序,每个字符串元素都包含一个未知大小/长度和符号的浮点数(可以是正或负),并且可以出现在字符串中的任意位置,没有已知的一致定界符

Thank you for your ideas! 谢谢您的想法!

You just need to extract the float/int from each string, along with the sign ( + or - ) and then pass that extracted part into the float() function and sort. 您只需要从每个字符串中提取float / int以及符号( +- ),然后将提取的部分传递给float()函数并进行排序。

So the regex I came up with ( regex101 ) is: 所以我想出的正则表达式( regex101 )是:

(\+|-)\d+(\.\d+)?

So we check that the float/int is preceded by a + or a - and then match as many as possible up to the decimal point ( . ) and then as many as possible decimals after - only if there is a decimal point. 因此,我们检查float / int是否以+-开头,然后匹配尽可能多的小数点( . ),然后匹配尽可能多的小数-仅在存在小数点的情况下。 This last part ("only if there is") is achieved simply with a ? 最后一部分(“仅在存在的情况下”)可以简单地通过?来实现? - meaning 0 or 1 occurrences. -表示01次出现。

So now to apply this to Python, with your list, l , and having already run import re , you can sort it with this one line: 因此,现在将其应用到Python,并使用列表l ,并且已经运行import re ,则可以使用以下一行对其进行排序:

l.sort(key = lambda s: float(re.search('(\+|-)\d+(\.\d+)?', s).group()))

which, for the last example, gives l as: 在最后一个示例中,将l为:

['a5-_-12.0b.mrc', '-a11_-10.0.mrc', 'a + z_-x_y_+6.10334_a4.mrc', 'b-c_ab_+15.0_rs+t.mrc', 'a+101.9-.mrc']

which I believe to be correct! 我相信是正确的!


And for the listfromhell example, this achieves the expected output of: 对于listfromhell示例,这实现了预期的输出:

['a_-12.0_b.mrc', 'a_-10.0.mrc', 'a_+10.9.mrc', 'bc_ab_+15.0_rst.mrc', 'az_x_y_+60.13_a.hdf']

Use regex to split the strings: 使用正则表达式分割字符串:

import re

# taken from https://gist.github.com/smac89/bfefc0303c2aab6cac0b08055e195c55
regex = r'.*?([-+](?:\d+\.\d*|\.?\d+)(?:[eE][-+]?\d+)?).*'
compiled = re.compile(regex)

listfromhellandthensome=['a5-_-12.0b.mrc','a+101.9-.mrc','-a11_-10.0.mrc','b-c_ab_+15.0_rs+t.mrc','a + z_-x_y_+6.10334_a4.mrc']

print (sorted(listfromhellandthensome, key=lambda s: float(compiled.sub(r'\1', s))))

Output: 输出:

 ['a5-_-12.0b.mrc', '-a11_-10.0.mrc', 'a + z_-x_y_+6.10334_a4.mrc', 'b-c_ab_+15.0_rs+t.mrc', 'a+101.9-.mrc'] 

The above regex matches values such as -.0 , -4. 上面的正则表达式匹配-.0-4.-4. , +5.0 , -3.e3 , +5.2E-1 , etc. Basically any valid floating point value in python is recognized. +5.0-3.e3+5.2E-1等,基本上在python任何有效的浮点值被识别。 This may or may not be what you want, but I'm just making you aware. 这可能不是您想要的,但我只是想让您知道。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM