I need to read through lines in multiple files; the first value in each line is the runtime, the third is the job id, and the fourth is the status. I have created lists to store each of these values. Now I'm not understanding how to connect all of these lists and sort them based on the lines with the top 20 fastest runtimes. Does anybody have a suggestion for how I can do that? Thank you!
for filePath in glob.glob(os.path.join(path1, '*.gz')):
with gzip.open(filePath, 'rt', newline="") as file:
reader = csv.reader(file)
for line in file:
for row in reader:
runTime = row[0]
ID = row[2]
eventType = row[3]
jobList.append(ID)
timeList.append(runTime)
eventList.append(eventType)
jobList = sorted(set(jobList))
counter = len(jobList)
print ("There are %s unique jobs." % (counter))
i = 1
while i < 21:
print("#%s\t%s\t%s\t%s" % (i, timeList[i], jobList[i], eventList[i]))
i = i + 1
Instead of using three different lists, you can use a single list and append tuples to the list..Like so
combinedList.append((runTime, ID, eventType))
You can then sort the combinedList
of tuples as shown here: How to sort (list/tuple) of lists/tuples?
You can make more improvements, such as use namedtuples
in python etc. Look them up on SO or google
Note: there may be other "efficient" ways to do this. For example use python heapq
library and create a heap of size 20 to sort by top 20 run times. You can learn more about them on python's website or Stack overflow but you may need some more algorithmic background
Instead of maintaining three lists jobList
, timeList
, eventList
, you can store (runTime, eventType)
tuples in a dictionary, using ID
as key, by replacing
jobList = []
timeList = []
eventList = []
…
jobList.append(ID)
timeList.append(runTime)
eventList.append(eventType)
by
jobs = {} # an empty dictionary
…
jobs[ID] = (runTime, eventType)
To loop over that dictionary sorted by increasing runTime
values:
for ID, (runTime, eventType) in sorted(jobs.items(), key=lambda item: item[1][0]):
# do something with it
Using the python sorted
built in would work better for you if you kept runTime
, ID
, and eventType
together in a data structure. I would recommend using a namedtuple , as it allows you to be clear about what you're doing. You can do the following:
from collections import namedtuple
Job = namedtuple("Job", "runtime id event_type")
Then you're code could change to be:
for filePath in glob.glob(os.path.join(path1, '*.gz')):
with gzip.open(filePath, 'rt', newline="") as file:
reader = csv.reader(file)
for line in file:
for row in reader:
runTime = row[0]
ID = row[2]
eventType = row[3]
job = Job(runTime, ID, eventType)
jobs.append(job)
jobs = sorted(jobs)
n_jobs = len(jobs)
print("There are %s unique jobs." % (n_jobs))
for job in jobs[:20]:
print("#%s\t%s\t%s\t%s" % (i, job.runtime, job.id, job.event_type))
It's worth noting, this sorting will work properly because by default, tuples are sorted by their first element. If there is a tie, your sort algorithm will move the comparison to the next elements of the tuple.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.