简体   繁体   English

在Python中对序列进行排序的最佳方法是什么?

[英]What is the best way to sort a sequence in Python?

I am trying to sort the table based on certain conditions that need to happen in a row. 我试图根据需要连续发生的某些条件对表进行排序。 Simplified version of a table: 表的简化版本:

Number  Time
   1    23
   2    45
   3    67
   4    23
   5    11
   6    45
   7    123
   8    34

... ...

I need to check if time was <40 5 times in a row. 我需要检查时间是否连续<40次。 Like I need to check rows 1-5, then 2-6 etc... And then print and save to a file the first and last time. 就像我需要检查1-5行,然后2-6等...然后打印并保存到文件的第一次和最后一次。 Like, if the condition is met for rows 2-6 I will need to print time for Number 2 and Number 6. The checking should stop after condition has been met. 比如,如果第2-6行符合条件,我将需要打印第2号和第6号的时间。检查应在条件满足后停止。 No need to check other rows. 无需检查其他行。 I implemented a counter with two temp variables to check for 3 items in a row so far. 我实现了一个带有两个临时变量的计数器来检查到目前为止连续3个项目。 It works fine. 它工作正常。 But, if I want to check for the condition that happened 30 times in a row, I can not just create 30 temp variables manually. 但是,如果我想检查连续30次发生的情况,我不能手动创建30个临时变量。 What is the best way to achieve that? 实现这一目标的最佳方法是什么? I guess I will just need some kind of a loop. 我想我只需要某种循环。 Thanks! 谢谢!

Here is part of my code: 这是我的代码的一部分:

reader = csv.reader(open(filename))
counter, temp1, temp2, numrow = 0, 0, 0, 0

for row in reader:
    numrow+=1
    if numrow <5:
        col0, col1, col4, col5, col6, col23, col24, col25 = float(row[0]),
            float(row[1]), float(row[4]), float(row[5]),float(row[6]), 
            float(row[23]), float(row[24]), float(row[25])
        if col1 <= 40:
            list1=(col1, col3, col4, col5, col6, col23, col24, col25)
            counter += 1
            if counter == 3:
                print("Cell# %s" %filename[-10:-5])
                print LAYOUT.format(*headers_short)
                print LAYOUT.format(*temp1)
                print LAYOUT.format(*temp2)
                print LAYOUT.format(*list1)
                print ""

            elif counter == 1:
                temp1=list1

            elif counter == 2:
                temp2=list1

        else:
            counter = 0

I implemented solution suggested by Bakuriu and it seems to be working. 我实施了Bakuriu建议的解决方案,它似乎正在运作。 But what will be the best way to combine numerous testing? 但是,结合众多测试的最佳方法是什么? Like I need to check for several conditions. 就像我需要检查几个条件。 Lets say: v 让我们说:v

  • efficiency for less than 40 in 10 cycles in a row, 连续10个周期内效率低于40,
  • capacity for less than 40 in 5 cycles in row 连续5个循环的容量小于40
  • time for less than 40 for 25 cycles in a row 连续25个循环的时间少于40
  • and some others... 还有一些......

Right now I just open csv.reader for every testing and run the function. 现在我只需打开csv.reader进行每次测试并运行该功能。 I guess it is not the most efficient way, although it works. 我想这不是最有效的方式,虽然它有效。 Sorry, I am just a complete noob. 对不起,我只是一个完整的菜鸟。

csvfiles = glob.glob('processed_data/*.stat')
for filename in csvfiles: 

    flag=[]
    flag.append(filename[-12:-5])
    reader = csv.reader(open(filename))
    for a, row_group in enumerate(row_grouper(reader,10)):
        if all(float(row[1]) < 40 for row in row_group):         
            str1= "Efficiency is less than 40 in cycles "+ str(a+1)+'-'+str(a+10)  #i is the index of the first row in the group.
            flag.append(str1)
            break #stop processing other rows.

    reader = csv.reader(open(filename))    
    for b, row_group in enumerate(row_grouper(reader,5)):
        if all(float(row[3]) < 40 for row in row_group):
            str1= "Capacity is less than 40 minutes in cycles "+ str(a+1)+'-'+str(a+5)
            flag.append(str1)
            break #stop processing other rows.

    reader = csv.reader(open(filename))    
    for b, row_group in enumerate(row_grouper(reader,25)):
        if all(float(row[3]) < 40 for row in row_group):
            str1= "Time is less than < 40 in cycles "+ str(a+1)+'-'+str(a+25)
            flag.append(str1)
            break #stop processing other rows.

   if len(flag)>1:

       for i in flag:
            print i
        print '\n'

You don't have to sort the data at all. 您根本不必对数据进行排序。 A simple solution might be: 一个简单的解决方案可能是

def row_grouper(reader):
    iterrows = iter(reader)
    current = [next(iterrows) for _ in range(5)]
    for next_row in iterrows:
        yield current
        current.pop(0)
        current.append(next_row)


reader = csv.reader(open(filename))

for i, row_group in enumerate(row_grouper(reader)):
    if all(float(row[1]) < 40 for row in row_group):
        print i, i+5  #i is the index of the first row in the group.
        break #stop processing other rows.

The row_grouper function is a generator that yields 5-element lists of consecutive rows. row_grouper函数是一个生成连续行的5个元素列表的生成器。 Every time it removes the first row of the group and adds the new row at the end. 每次删除组的第一行并在末尾添加新行。


Instead of a plain list you can use a deque and replace the pop(0) in row_grouper with a popleft() call which is more efficient, although this doesn't matter much if the list has only 5 elements. 您可以使用deque而不是普通list ,并使用popleft()调用替换row_grouperpop(0) ,这更有效,尽管如果列表只有5个元素,这并不重要。

Alternatively you can use martineau suggestion and use the maxlen keyword argument and avoid pop ing. 或者,您可以使用martineau建议并使用maxlen关键字参数并避免pop This is about twice as fast as using a deque's popleft, which is about twice as fast as using the list 's pop(0) . 这大约是使用deque的popleft的两倍,这大约是使用listpop(0)两倍。


Edit: To check more than one condition you can modify use more than one row_grouper and use itertools.tee to obtain copies of the iterables. 编辑:要检查多个条件,您可以修改使用多个row_grouper并使用itertools.tee来获取迭代的副本。

For example: 例如:

import itertools as it

def check_condition(group, row_index, limit, found):
    if group is None or found:
        return False
    return all(float(row[row_index]) < limit for row in group)


f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

found_first = found_second = found_third = False

for index, (first, second, third) in enumerate(it.izip_longest(*groups)):
    if check_condition(first, 1, 40, found_first):
        #stuff
        found_first = True
    if check_condition(second, 3, 40, found_second):
        #stuff
        found_second = True
    if check_condition(third, 3, 40, found_third): 
        # stuff
        found_third = True
    if found_first and found_second and found_third:
        #stop the code if we matched all the conditions once.
        break

The first part simply imports itertools (and assigns an "alias" it to avoid typing itertools every time). 第一部分只是导入itertools (并为其指定一个“别名” it以避免每次都输入itertools )。

I've defined the check_condition function, since the conditions are getting more complicated and you don't want to repeat them over and over. 我已经定义了check_condition函数,因为条件变得越来越复杂,你不想一遍又一遍地重复它们。 As you can see the last line of check_condition is the same as the condition before: it checks if the current "row group" verifies the property. 正如您所看到的, check_condition的最后一行与之前的条件相同:它检查当前“行组”是否验证该属性。 Since we plan to iterate over the file only once, and we cannot stop the loop when only one condition is met(since we'd miss the other conditions) we must use some flag that tells us if the condition on (eg) time was met before or not. 由于我们计划只迭代文件一次,并且当只满足一个条件时我们无法停止循环(因为我们错过了其他条件)我们必须使用一些标志告诉我们条件是否(例如)时间是以前见过或不见过面。 As you can see in the for loop, we break out of the loop when all the conditions are met. 正如您在for循环中看到的那样,当满足所有条件时,我们会break循环。

Now, the line: 现在,行:

f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

Creates an iterable over the rows of reader and makes 3 copies of it. reader行上创建一个iterable,并为其创建3个副本。 This means that the loop: 这意味着循环:

for row in f_iter:
    print(row)

Will print all the rows of the file, just like doing for row in reader . 将打印文件的所有行,就像for row in reader执行for row in reader一样。 Note however that itertools.tee allows us to obtain copies of the rows without reading the file more than once. 但请注意, itertools.tee允许我们获取行的副本而不必多次读取文件。

Afterwards, we must pass these rows to the row_grouper in order to verify the conditions: 之后,我们必须将这些行传递给row_grouper以验证条件:

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

Finally we have to loop over the "row groups". 最后,我们必须遍历“行组”。 To do this simultaneously we use itertools.izip_longest (renamed to itertools.zip_longest (without i ) in python3). 为了同时执行此操作,我们使用itertools.izip_longest (在python3中重命名为itertools.zip_longest (不带i ))。 It works just like zip , creating pairs of elements (eg zip([1, 2, 3], ["a", "b", "c"]) -> [(1, "a"), (2, "b"), (3, "c")] ). 它就像zip一样工作,创建成对元素(例如zip([1, 2, 3], ["a", "b", "c"]) -> [(1, "a"), (2, "b"), (3, "c")] )。 The difference is that izip_longest pads the shorter iterables with None s. 不同的是, izip_longest 具有较短iterables None秒。 This assures that we check the conditions on all the possible groups(and that's also why check_condition has to check if group is None ). 这确保我们检查所有可能组的条件(这也是check_condition必须检查group是否为None )。

To obtain the current row index we wrap everything in enumerate , just like before. 要获取当前行索引,我们将所有内容包装在enumerate ,就像之前一样。 Inside the for the code is pretty simple: you check the conditions using check_condition and, if the condition is met you do what you have to do and you have to set the flag for that condition(so that in the following loops the condition will always be False ). 里面for代码很简单:您使用检查条件check_condition ,如果条件满足,你做你必须做什么你必须设置标志条件(因此在下面的循环的条件总是是False )。

(Note: I must say I did not test the code. I'll test it when I have a bit of time, anyway I hope I gave you some ideas. And check out the documentation for itertools ). (注意:我必须说我没有测试代码。当我有一点时间时我会测试它,无论如何我希望我给你一些想法。并查看itertools的文档)。

You don't really need to sort your data, just keep track of whether the condition you're looking for has occurred in the last N rows of data. 您不需要对数据进行排序,只需跟踪您要查找的条件是否发生在最后N行数据中。 Fixed-size collections.deque s are good for this sort of thing. 固定大小的collections.deque对这类事情很有用。

import csv
from collections import deque
filename = 'table.csv'
GROUP_SIZE = 5
THRESHOLD = 40
cond_deque = deque(maxlen=GROUP_SIZE)

with open(filename) as datafile:
    reader = csv.reader(datafile) # assume delimiter=','
    reader.next() # skip header row
    for linenum, row in enumerate(reader, start=1):  # process rows of file
        col0, col1, col4, col5, col6, col23, col24, col25 = (
            float(row[i]) for i in (0, 1, 4, 5, 6, 23, 24, 25))
        cond_deque.append(col1 < THRESHOLD)
        if cond_deque.count(True) == GROUP_SIZE:
            print 'lines {}-{} had {} consecutive rows with col1 < {}'.format(
                linenum-GROUP_SIZE+1, linenum, GROUP_SIZE, THRESHOLD)
            break  # found, so stop looking

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM