简体   繁体   中英

indexing varchars without duplicating the data

I've huge data-set of (~1 billion) records in the following format

|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|

Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)

There are two major actions that run against the data-set :

  1. contains - Simple check if a key is found
  2. count - Aggregation (mostly) functions according to the data fields

Is there a way to store the key data only once?

One idea I had is to drop the DB all together and simply create 2-5 char folder structure. this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as

~/data/the/sim/on_/wro/te_/thi/s.data 

This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).

This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.

I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).

It would seem like a very good idea to consider having a fixed-width , integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM