AWK vs MySQL for Data Aggregation

Question

In trying to figure out if AWK or MySQL is more efficient for processing log files and returning aggregate stats, I noticed the following behavior which doesn't make sense to me:

To test this I used a file that had 4 columns and approximately 9 Million records. I used the same server, which is a VPS with a SSD and 1GB of RAM.

column1 is a column that has about 10 unique values and the number of total unique values for the combination of all columns is approximately 4k.

In MySQL I use a table defined as table (column1, column2, column3, column4) with no indices.

Data Format:

column1,column2,column3,column4
    column1,column2,column3,column4

AWK Script:

BEGIN {
    FS = ",";
    time = systime();
}  {
    array[$1]++;  #first test
    #array[$1 "," $2 "," $3 "," $4]++; #second test
}
} END {
    for (value in array) {
            print "array[" value "]=" array[value];
    }
}

MySQL Query:

Query 1: SELECT column1, count(*) FROM log_test GROUP BY column1;

Query 2: SELECT column1, column2, column3, column4, count(*) 
FROM log_test GROUP BY column1, column2, column3, column4;

AWK is slower than MySQL as expected. However, when I run the first test that returns the aggregate data with 10 lines MySQL takes around 7 secs to finish and AWK takes around 22 secs.

I understand that awk reads line by line and processes the data, so I would expect that when I run the second test, that has an output of 4k lines, AWK should take about the same time as it did for the first test, being that it still has the same number of lines to read and it isnt doing much more processing. However AWK takes about 90 secs but only uses .1% MEM while MySQL takes about 45 secs and uses 3% MEM.

Why does AWK take so much longer for test 2 than test 1 when it is essentially reading the same file?
Why does AWK not use more memory and is awk storing its values on the harddisk vice in memory?
Why is MySQL so much faster when it essentially has to read the table line by line as well?
Are there more efficient alternatives to aggregating this data?

Answer 1

Awk has to store all of the tuples in the second case (and juggle a much larger associative map). To verify this, try the intermediate steps of 2- and 3- field counts

As for memory usage, can you look at the exact number of bytes used by the process? Force awk to sleep at the end and measure the memory usage in both cases (in bytes) and you will see a difference

MySQL stores the numerical data in a more efficient way than merely printing text. More importantly, it is probably storing the data in a pre-parsed form whereas awk has to perform an expensive record and field split on every line (you didn't show the MySQL setup, but if you used char(10) or other fixed width fields MySQL doesn't have to re-process the data).

The most efficient way would be to pre-sort or apply an index that is maintained as you update, although it comes at the cost of per-insert time. Alternatively, if the columns are small and have known widths, you can write your own C utility that takes advantage of the assumptions (the file would merely be a set of structs)

Answer 2

In both cases field splitting needs to take place; you are correct that the difference in processing is negligible.

However you need to take into consideration the method of how Awk implements associative arrays. In order to increment a given array entry, it needs to construct the string used as the index, and then find that entry in the list of possible indices.

I infer from the problem statement that in the case of:

array[$1]++

the input data has 10 distinct values for $1 each of which is less than 20 characters (as indicated by the MYSQL table spec). Constructing the index entails a copy of 20 characters from the input record. For each of the 9 Million input records only a maximum of 10 strings each less than 20 chars need to be compared against the first field to determine which entry of "array" to increment.

But in the case of:

array[$1 "," $2 "," $3 "," $4]++

We need to copy up to 80 characters from the input record to the temporary memory where the index is assembled. In the first case we only need to copy 20 characters.

You stated that the output will have 4000+ lines, which means that towards the end of the 9 million records each potential increment must search and compare up to 4000 80 character strings.

I don't know the gory details of the methods that Awk uses to index/hash the associative array indices (I would hope that it would somehow be more efficient than a straight search/compare iteration), but you can see that searching a list of 10 vs a list of 4000 can have the impact observed.

You will also note that the length of the input fields will also impact AWK processing. If a field is 5 chars vs 20 chars copying the field takes 4 times as long.

Finally, note that when comparing AWK to MYSQL you must also take into account the time needed to load the data in to the MYSQL database. If the data will be loaded regardless of whether AWK or MYSQL will be used to aggregate the output, then you will probably be better off using MYSQL to aggregate.

But if you need to load it into the MYSQL database only so that it can be aggregated, then this time must be added to the QUERY time and I think the end results will be a lot closer.

Answer 3

If you consider that large text files can be compressed by 8:1, and SQL does not store data as ascii text (it uses compression methods) and decompressing is much faster than reading from disk (you see that in your low processor activity).

If sql is able to search the compressed data directly, there is much less work involved. Indexing and other pre-work is done by sql to make searches faster in sql as well.

AWK vs MySQL for Data Aggregation

Question

3 answers

solution1
0 ACCPTED 2013-10-18 21:49:07

solution2
0 2013-11-05 18:43:44

solution3
0 2022-01-16 15:15:02

AWK vs MySQL for Data Aggregation

Question

3 answers

solution1 0 ACCPTED 2013-10-18 21:49:07

solution2 0 2013-11-05 18:43:44

solution3 0 2022-01-16 15:15:02

solution1
0 ACCPTED 2013-10-18 21:49:07

solution2
0 2013-11-05 18:43:44

solution3
0 2022-01-16 15:15:02