I have a collection of large (~100,000,000 line) text files in the format:
0.088293 1.3218e-32 2.886e-07 2.378e-02 21617 28702
0.111662 1.1543e-32 3.649e-07 1.942e-02 93804 95906
0.137970 1.2489e-32 4.509e-07 1.917e-02 89732 99938
0.149389 8.0725e-32 4.882e-07 2.039e-02 71615 69733
...
And I'd like to find the mean and sum of column 2 and maximum and minimum values of columns 3 and 4, and the total number of lines. How can I do this efficiently using NumPy? Because of their size, loadtxt
and genfromtxt
are no good (take a long time to execute) since they attempt to read the whole file into memory. In contrast, Unix tools like awk
:
awk '{ total += $2 } END { print total/NR }' <filename>
work in a reasonable amount of time. Can Python/NumPy do the job of awk
for such big files?
You can say something like:
awk '{ total2 += $2
for (i=2;i<=3;i++) {
max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i
min[i]=(length(min[i]) && min[i]<$i)?min[i]:$i
}
} END {
print "items", "average2", "min2", "min3", "max2", "max3"
print NR, total2/NR, min[2], min[3], max[2], max[3]
}' file
With your given input:
$ awk '{total2 += $2; for (i=2;i<=3;i++) {max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i; min[i]=((length(min[i]) && min[i]<$i)?min[i]:$i)}} END {print "items", "average2", "min2", "min3", "max2", "max3"; print NR, total2/NR, min[2], min[3], max[2], max[3]}' a | column -t
items average2 min2 min3 max2 max3
4 2.94938e-32 1.1543e-32 2.886e-07 8.0725e-32 4.882e-07
loop through the lines and apply regex to extract the data you are looking for, adding it into an initially empty list for each column you desire.
Once you have the column in list form you can apply max(list) min(list) avg(list) functions to the data to get whatever calculations you are interested in.
note: You may need to revise where you added the data to the list and convert the numbers from str to int form so that the max, min, avg functions can operate on them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.