indexing varchars without duplicating the data

Question

I've huge data-set of (~1 billion) records in the following format

|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|

Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)

There are two major actions that run against the data-set :

contains - Simple check if a key is found
count - Aggregation (mostly) functions according to the data fields

Is there a way to store the key data only once?

One idea I had is to drop the DB all together and simply create 2-5 char folder structure. this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as

~/data/the/sim/on_/wro/te_/thi/s.data

This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).

This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.

I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).

Answer 1

It would seem like a very good idea to consider having a fixed-width , integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.

indexing varchars without duplicating the data

Question

1 answers

solution1
0 ACCPTED 2012-09-09 13:56:31

indexing varchars without duplicating the data

Question

1 answers

solution1 0 ACCPTED 2012-09-09 13:56:31

solution1
0 ACCPTED 2012-09-09 13:56:31