简体   繁体   English

仅选择低于特定阈值的值

[英]Pick values only below a certain threshold

Say I have a number of values: (left column is just the value count: 1,2,3, etc; right are the actual values) 说我有很多值:(左列只是值计数:1、2、3等;右边是实际值)

1 5.2
2 1.43
3 3.54
4 887
5 0.35

What I want to do is reorder the values from decreasing to increasing (top-down), then I'd like to have python go through the values and keep picking the values (to be later used as an output) until it comes across a value that is at or above some threshold. 我想做的是将值从减小到增大(从上到下)重新排序,然后我想让python遍历这些值并继续选择这些值(以后用作输出),直到遇到一个等于或高于某个阈值的值。 For example: 例如:

5 0.35
2 1.43
3 3.54
1 5.2
4 887

Say, my threshold is at 5.0 so here I'd like the program to discard 1 and 4 (high values) and give 5, 2, and 3 as output along with their corresponding values. 说,我的阈值是5.0,所以在这里我希望程序放弃1和4(高值),并给出5、2和3及其对应的值作为输出。 I hope that makes sense. 我希望这是有道理的。 Also as a trickier trick if (for whatever reason) my threshold only allows for 2 values I'd like it to ignore everything and give something like, 'No values found'. 同样,如果(出于某种原因)我的阈值仅允许2个值,我想让它忽略所有内容并给出类似“未找到值”的信息,这也是一个棘手的技巧。

The file they're located in from which I'll be pulling them (the values and counts) roughly looks like this: 他们将要从中找到它们的文件(值和计数)大致如下所示:

  ID  some: value  another: value another: value another: value another: value another: value 1: 5.2

etc etc, each of the above mentioned values corresponds to a new line in the file. 等等,每个上述值对应文件中的新行。 So the things I'm interested in would be located at row x, column 14 and 15 respectively. 因此,我感兴趣的东西分别位于第x行第14列和第15列。

The actual line would look like this: 实际的行如下所示:

Mod# 2 11494    Chi^2:  1.19608371367   Scale:  0.567691651772  Tin:    1499    Teff:   3400    Luminosity:     568.0   L   M-dot: 4.3497e-08   Tau: 2.44E-01   Dust composition: Fe    IRx1:   0.540471121182

I'm interested in IRx1 and the value following it. 我对IRx1及其后的值感兴趣。

Assuming your file has one number per line: 假设文件每行有一个数字:

threshold = 5
with open('path/to/file') as infile:
    numbers = [float(line.strip()) for line in infile]
numbers.sort(reverse=True)
bigger = list(itertools.takewhile(lambda n: n<threshold, numbers))

If your file looks like this: 如果您的文件如下所示:

1 5.2
2 1.43
3 3.54
4 887
5 0.35

and you want your output to be set([2,3,5]) , then: 并且您希望将输出set([2,3,5]) ,然后:

with open('path/to/file') as infile:
    numbers = dict([float(i) for i in line.strip()] for line in infile)
lines = sorted(numbers, key=numbers.__getitem__, reverse=True)
answer = list(itertools.takewhile(lambda n: numbers[n]<threshold, lines))

Given a file that looks like this: 给定一个看起来像这样的文件:

Mod# 2 11494    Chi^2:  1.19608371367   Scale:  0.567691651772  Tin:    1499    Teff:   3400    Luminosity:     568.0   L   M-dot: 4.3497e-08   Tau: 2.44E-01   Dust composition: Fe    IRx1:   0.540471121182

where there is a tab ( \\t ) separating 11494 and Chi^2 , the following script should work: 如果有一个制表符( \\t )分隔11494Chi^2 ,则以下脚本应该起作用:

def takeUntil(fpath, colname, threshold):
    lines = []
    with open(fpath) as infile:
        for line in infile:
            ldict = {}
            firsts = line.split('\t', 2)
            ldict[firsts[0] = float(firsts[1])
            splits = firsts[2].split('\t')
            ldict.update(dict(zip(firsts, itertools.islice(firsts, 1, len(firsts)))))
            lines.append(ldict)
    lines.sort(reverse=True, key=operator.itemgetter(colname))
    return [row['Mod#'] for row in itertools.takewhile(lambda row: row[colname]<threshold, lines)]

With that function, you should be able to specify which column's values you want to check to be under the threshold. 使用该功能,您应该能够指定要检查的列值是否在阈值以下。 Though this algorithm does have a higher space complexity (uses more RAM than absolutely necessary), you should be able to marshall/pickle lines after reading the file and continue from there for subsequent runs. 尽管此算法确实具有较高的空间复杂度(使用的RAM超出绝对必要的数量),但是您应该能够在读取文件后编组/戳lines ,然后从那里继续进行后续运行。 This is especially useful if you have a huge input file which takes a while to process (as I suspect you might have) 如果您有一个庞大的输入文件需要花费一些时间来处理(我怀疑您可能已经拥有),则此功能特别有用

The following solution assumes that the data was read in as a list of tuples. 以下解决方案假定将数据作为元组列表读入。

Ex: 例如:

[(1,5.2),
(2,1.43),
(3,3.54),
(4,887),
(5,0.35)]

would be the list for the sample data in the problem. 将是问题中示例数据的列表。

def cutoff(threshold, data):
    sortedData = sorted(data, key=lambda x: x[1])
    finalList = filter(lambda x: x[1] < threshold, sortedData)
    return finalList if len(finalList) > 2 else 'No values found'

The first line of the function sorts the list by the values in the second place of the tuple. 函数的第一行按元组第二个位置的值对列表进行排序。

The second line of the function then filters that resulting list so that only the elements in which the values are below the threshold remain. 然后,函数的第二行将过滤结果列表,以便仅保留值低于阈值的元素。

The third line then returns the resulting sorted list if it contains more than two elements, and 'No values found' otherwise, which should accomplish what you're trying to do, less the file input. 如果它包含两个以上元素,则第三行返回结果排序列表,否则返回“找不到值”,这应该可以完成您要尝试的操作,但要减少文件输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM