What is the best way to sort a sequence in Python?

Question

I am trying to sort the table based on certain conditions that need to happen in a row. Simplified version of a table:

Number  Time
   1    23
   2    45
   3    67
   4    23
   5    11
   6    45
   7    123
   8    34

...

I need to check if time was <40 5 times in a row. Like I need to check rows 1-5, then 2-6 etc... And then print and save to a file the first and last time. Like, if the condition is met for rows 2-6 I will need to print time for Number 2 and Number 6. The checking should stop after condition has been met. No need to check other rows. I implemented a counter with two temp variables to check for 3 items in a row so far. It works fine. But, if I want to check for the condition that happened 30 times in a row, I can not just create 30 temp variables manually. What is the best way to achieve that? I guess I will just need some kind of a loop. Thanks!

Here is part of my code:

reader = csv.reader(open(filename))
counter, temp1, temp2, numrow = 0, 0, 0, 0

for row in reader:
    numrow+=1
    if numrow <5:
        col0, col1, col4, col5, col6, col23, col24, col25 = float(row[0]),
            float(row[1]), float(row[4]), float(row[5]),float(row[6]), 
            float(row[23]), float(row[24]), float(row[25])
        if col1 <= 40:
            list1=(col1, col3, col4, col5, col6, col23, col24, col25)
            counter += 1
            if counter == 3:
                print("Cell# %s" %filename[-10:-5])
                print LAYOUT.format(*headers_short)
                print LAYOUT.format(*temp1)
                print LAYOUT.format(*temp2)
                print LAYOUT.format(*list1)
                print ""

            elif counter == 1:
                temp1=list1

            elif counter == 2:
                temp2=list1

        else:
            counter = 0

I implemented solution suggested by Bakuriu and it seems to be working. But what will be the best way to combine numerous testing? Like I need to check for several conditions. Lets say: v

efficiency for less than 40 in 10 cycles in a row,
capacity for less than 40 in 5 cycles in row
time for less than 40 for 25 cycles in a row
and some others...

Right now I just open csv.reader for every testing and run the function. I guess it is not the most efficient way, although it works. Sorry, I am just a complete noob.

csvfiles = glob.glob('processed_data/*.stat')
for filename in csvfiles: 

    flag=[]
    flag.append(filename[-12:-5])
    reader = csv.reader(open(filename))
    for a, row_group in enumerate(row_grouper(reader,10)):
        if all(float(row[1]) < 40 for row in row_group):         
            str1= "Efficiency is less than 40 in cycles "+ str(a+1)+'-'+str(a+10)  #i is the index of the first row in the group.
            flag.append(str1)
            break #stop processing other rows.

    reader = csv.reader(open(filename))    
    for b, row_group in enumerate(row_grouper(reader,5)):
        if all(float(row[3]) < 40 for row in row_group):
            str1= "Capacity is less than 40 minutes in cycles "+ str(a+1)+'-'+str(a+5)
            flag.append(str1)
            break #stop processing other rows.

    reader = csv.reader(open(filename))    
    for b, row_group in enumerate(row_grouper(reader,25)):
        if all(float(row[3]) < 40 for row in row_group):
            str1= "Time is less than < 40 in cycles "+ str(a+1)+'-'+str(a+25)
            flag.append(str1)
            break #stop processing other rows.

   if len(flag)>1:

       for i in flag:
            print i
        print '\n'

Answer 1

You don't have to sort the data at all. A simple solution might be:

def row_grouper(reader):
    iterrows = iter(reader)
    current = [next(iterrows) for _ in range(5)]
    for next_row in iterrows:
        yield current
        current.pop(0)
        current.append(next_row)


reader = csv.reader(open(filename))

for i, row_group in enumerate(row_grouper(reader)):
    if all(float(row[1]) < 40 for row in row_group):
        print i, i+5  #i is the index of the first row in the group.
        break #stop processing other rows.

The row_grouper function is a generator that yields 5-element lists of consecutive rows. Every time it removes the first row of the group and adds the new row at the end.

Instead of a plain list you can use a deque and replace the pop(0) in row_grouper with a popleft() call which is more efficient, although this doesn't matter much if the list has only 5 elements.

Alternatively you can use martineau suggestion and use the maxlen keyword argument and avoid pop ing. This is about twice as fast as using a deque's popleft, which is about twice as fast as using the list 's pop(0) .

Edit: To check more than one condition you can modify use more than one row_grouper and use itertools.tee to obtain copies of the iterables.

For example:

import itertools as it

def check_condition(group, row_index, limit, found):
    if group is None or found:
        return False
    return all(float(row[row_index]) < limit for row in group)


f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

found_first = found_second = found_third = False

for index, (first, second, third) in enumerate(it.izip_longest(*groups)):
    if check_condition(first, 1, 40, found_first):
        #stuff
        found_first = True
    if check_condition(second, 3, 40, found_second):
        #stuff
        found_second = True
    if check_condition(third, 3, 40, found_third): 
        # stuff
        found_third = True
    if found_first and found_second and found_third:
        #stop the code if we matched all the conditions once.
        break

The first part simply imports itertools (and assigns an "alias" it to avoid typing itertools every time).

I've defined the check_condition function, since the conditions are getting more complicated and you don't want to repeat them over and over. As you can see the last line of check_condition is the same as the condition before: it checks if the current "row group" verifies the property. Since we plan to iterate over the file only once, and we cannot stop the loop when only one condition is met(since we'd miss the other conditions) we must use some flag that tells us if the condition on (eg) time was met before or not. As you can see in the for loop, we break out of the loop when all the conditions are met.

Now, the line:

f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

Creates an iterable over the rows of reader and makes 3 copies of it. This means that the loop:

for row in f_iter:
    print(row)

Will print all the rows of the file, just like doing for row in reader . Note however that itertools.tee allows us to obtain copies of the rows without reading the file more than once.

Afterwards, we must pass these rows to the row_grouper in order to verify the conditions:

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

Finally we have to loop over the "row groups". To do this simultaneously we use itertools.izip_longest (renamed to itertools.zip_longest (without i ) in python3). It works just like zip , creating pairs of elements (eg zip([1, 2, 3], ["a", "b", "c"]) -> [(1, "a"), (2, "b"), (3, "c")] ). The difference is that izip_longest pads the shorter iterables with None s. This assures that we check the conditions on all the possible groups(and that's also why check_condition has to check if group is None ).

To obtain the current row index we wrap everything in enumerate , just like before. Inside the for the code is pretty simple: you check the conditions using check_condition and, if the condition is met you do what you have to do and you have to set the flag for that condition(so that in the following loops the condition will always be False ).

(Note: I must say I did not test the code. I'll test it when I have a bit of time, anyway I hope I gave you some ideas. And check out the documentation for itertools ).

Answer 2

You don't really need to sort your data, just keep track of whether the condition you're looking for has occurred in the last N rows of data. Fixed-size collections.deque s are good for this sort of thing.

import csv
from collections import deque
filename = 'table.csv'
GROUP_SIZE = 5
THRESHOLD = 40
cond_deque = deque(maxlen=GROUP_SIZE)

with open(filename) as datafile:
    reader = csv.reader(datafile) # assume delimiter=','
    reader.next() # skip header row
    for linenum, row in enumerate(reader, start=1):  # process rows of file
        col0, col1, col4, col5, col6, col23, col24, col25 = (
            float(row[i]) for i in (0, 1, 4, 5, 6, 23, 24, 25))
        cond_deque.append(col1 < THRESHOLD)
        if cond_deque.count(True) == GROUP_SIZE:
            print 'lines {}-{} had {} consecutive rows with col1 < {}'.format(
                linenum-GROUP_SIZE+1, linenum, GROUP_SIZE, THRESHOLD)
            break  # found, so stop looking

What is the best way to sort a sequence in Python?

Question

2 answers

solution1
2 2013-07-09 20:49:29

solution2
1 ACCPTED 2013-07-09 22:01:32

What is the best way to sort a sequence in Python?

Question

2 answers

solution1 2 2013-07-09 20:49:29

solution2 1 ACCPTED 2013-07-09 22:01:32

solution1
2 2013-07-09 20:49:29

solution2
1 ACCPTED 2013-07-09 22:01:32