简体   繁体   中英

Why is awk so much faster than python in this case?

I have a clip list with 200,000 rows, each row is of the form

<field 1> <field2>

In order to get just field 1, I can run a script that looks like this

import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()

for line in text: 
     clip_to_add =   line.split(" ")[0]
     list_of_clips = list_of_clips + clip_to_add +'\n' 

with open ('clips.list', 'w') as file:
file.write (list_of_clips)

jump.close()

or I can just use awk 'print{($1)}'

why is awk SO much quicker? It completes the job in about 1 second.

import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()

for line in text: 
     clip_to_add =   line.split(" ")[0]
     list_of_clips = list_of_clips + clip_to_add +'\n' 

with open ('clips.list', 'w') as file:
file.write (list_of_clips)

jump.close()

This code is poorly written from performance point of view. .readlines() needs to read whole file to create list (which is mutable, feature which you do not use at all), even despite in your case you do not have to know content of whole file to get processing done. When you are reading file you might use for line in <filehandle>: to avoid reading whole line to memory, using this you might print first field of SPACE-separated file.txt like so

with open("file.txt","r") as f:
    for line in f:
        print(line.split(" ")[0])

Moreover you do import os and then do not use any features contained therein and also open clips.list twice, once as clips later as file and then never make any use of former.

To sum it shortly: awk '{print $1}' is correctly written AWK code whilst presented python code is of very dubious quality, comparing them gives unreliable result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM