简体   繁体   中英

Find first and last list within a list with a matching property

Long time lurker, first time poster..

I have a extremely large text file (1,184,834 rows) containing some information regarding flight plans for a specific day in Europe. Every column represents a new key, and every row is a new segment of the flight. I have so far managed to extract the data i need for my analysis into a list of lists with the following code:

import pprint
import csv
pp = pprint.PrettyPrinter(width=200)

text = open('E:/Downloads/TNFL09/20120506_m1.so6', 'r')

def clean_data(text, rows):
    newlist = []
    reader = list(csv.reader(text, delimiter=' '))

    for n in xrange(0, len(reader)):
       newlist.append(reader[n][1:6]+reader[n][9:12]+reader[n][16:18])  

    return newlist[:rows]


data = clean_data(text,90)
pp.pprint(data)

The output looks as follows:

['UAFM', 'EGKK', 'A333', '083914', '084141', 'CMB595', '120506', '120506', '156912756', '91'],

['KEWR', 'VABB', 'B772', '005500', '010051', 'UAL48', '120506', '120506', '156912546', '1']

['KEWR', 'VABB', 'B772', '010051', '010310', 'UAL48', '120506', '120506', '156912546', '2']

The interesting items for this problem is start/end time(#3 & #4), flight ID(#8) and sequence number(#9).

Every flight consists of number of consecutive sequence numbers. So to get the whole flight one must extract all the sequence numbers for that flight ID.

What i want to do is to extract the start and end time for every flight. My initial thoght was to loop through each list in the list and compare the sequence number to the previously iterated list. However i am a beginner to Python and gave up after a few days of googling.

Thanks,

Peter

One way, assuming your list-of-lists is sorted by sequence number (looks like they are) is to run it through a generator to aggregate each flight together:

def aggregate_flights(flights):
    out = []
    last_id = ''
    for row in flights:
        if row[-2] != last_id and len(out) > 0:
            yield (last_id,out)
            out = []
        last_id = row[-2]
        out.append((row[3],row[4])) #2-tuple of (start,end)
    yield (last_id,out)

Which gives for your example input:

list(aggregate_flight(agg))
Out[21]: 
[('156912756', [('083914', '084141')]),
 ('156912546', [('005500', '010051'), ('010051', '010310')])]

A bit messy, but you get the idea. For each flight you'll have a list of 2-tuples of (start,end) which you can further process to get the overall (start,end) for that flight. You could even modify the generator to just give you the overall (start,end) , but I tend to like to do my processing in smaller, modular chunks that are easy to debug.

If your inputs are not sorted, you'll need to accumulate your data using a defaultdict . Give it a list factory and append a (start,end) tuple for each line.

edit: as requested, here's the modification to only yield a single (start,end) pair per flight:

def aggregate_flights(flights):
    last_id,start,end = None,None,None
    for row in flights:
        if row[-2] != last_id and last_id is not None:
            yield (last_id,(start,end))
            start,end = None,None
        if start is None:
            start = row[3]
        last_id = row[-2]
        end = row[4]
    yield (last_id,(start,end))

At this point I'd note that the output is getting too ugly to abide (an (id,(start,end)) tuple, ugh) so I'd move up to a namedtuple to make things nicer:

from collections import namedtuple
Flight = namedtuple('Flight',['id','start','end'])

So now you have:

def aggregate_flights(flights):
    last_id,start,end = None,None,None
    for row in flights:
        if row[-2] != last_id and last_id is not None:
            yield Flight(last_id,start,end)
            start,end = None,None
        if start is None:
            start = row[3]
        last_id = row[-2]
        end = row[4]
    yield Flight(last_id,start,end)

list(aggregate_flights(agg))
Out[18]: 
[Flight(id='156912756', start='083914', end='084141'),
 Flight(id='156912546', start='005500', end='010310')]

Much nicer.

I can't tell if your lists are already sorted by flightID and sequence number, to do that you could do the following on your list of lists:

from operator import itemgetter
#use sort if the original list is not necessary to maintain, 
#if it is use sorted and send it to a new variable
flightInfo.sort(key = itemgetter(8,9))

The above sorts first by flight number and then by sequence number. To extract what you want, you can do:

prev, startTime = None, None
results = []

for i, info in enumerate(flightInfo):
    if prev == None or prev != flight[8]:
         if prev != None:
              # use a list if you are going to have to modify these values
              results.append((prev, startTime, flightInfo[i-1][4])) 

         startTime = flight[3]
         prev = flight[8]

You can use map keyword. Being "full_list" the list of flights:

# python.py

time = [] # Is a dictionaries list. Each dictionary contains: {flight_id: [start, end]}

result = [] # We going to store results here.

def extract(flight, result):
   """ param flight: list containing flight's data. """
   global result # Give function access to result variable.
                 # If not, "result" is passed as a value copy.

   result.append({flight[9]: [flight[3], flight[3]]})

map(extract, result)

and that should do the work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM