简体   繁体   中英

How to get statistics on a large text file of data

I have a collection of large (~100,000,000 line) text files in the format:

    0.088293 1.3218e-32 2.886e-07 2.378e-02        21617        28702
    0.111662 1.1543e-32 3.649e-07 1.942e-02        93804        95906
    0.137970 1.2489e-32 4.509e-07 1.917e-02        89732        99938
    0.149389 8.0725e-32 4.882e-07 2.039e-02        71615        69733
    ...

And I'd like to find the mean and sum of column 2 and maximum and minimum values of columns 3 and 4, and the total number of lines. How can I do this efficiently using NumPy? Because of their size, loadtxt and genfromtxt are no good (take a long time to execute) since they attempt to read the whole file into memory. In contrast, Unix tools like awk :

awk '{ total += $2 } END { print total/NR }' <filename>

work in a reasonable amount of time. Can Python/NumPy do the job of awk for such big files?

You can say something like:

awk '{  total2 += $2
        for (i=2;i<=3;i++) {
            max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i
            min[i]=(length(min[i]) && min[i]<$i)?min[i]:$i
        }
     } END {
           print "items", "average2", "min2", "min3", "max2", "max3"
           print NR, total2/NR, min[2], min[3], max[2], max[3]
     }' file

Test

With your given input:

$ awk '{total2 += $2; for (i=2;i<=3;i++) {max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i; min[i]=((length(min[i]) && min[i]<$i)?min[i]:$i)}} END {print "items", "average2", "min2", "min3", "max2", "max3"; print NR, total2/NR, min[2], min[3], max[2], max[3]}' a | column -t
items  average2     min2        min3       max2        max3
4      2.94938e-32  1.1543e-32  2.886e-07  8.0725e-32  4.882e-07

loop through the lines and apply regex to extract the data you are looking for, adding it into an initially empty list for each column you desire.

Once you have the column in list form you can apply max(list) min(list) avg(list) functions to the data to get whatever calculations you are interested in.

note: You may need to revise where you added the data to the list and convert the numbers from str to int form so that the max, min, avg functions can operate on them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM