简体   繁体   中英

The difference of compression mechanisms between OLAP and TSDB databases

A 460MB CSV file was imported into both an OLAP database and a TSDB database with the same schema. The size of the OLAP database was 107M, but the size of the TSDB one was an astonishing 1.5G. Why does this happen? Any way to improve the data compression rate of the TSDB engine?

Here is my script for table creation:

// specify table schema
tbSchema=extractTextSchema(fileDir + fileNames["filename"][0], "\t")
update tbSchema set type="SYMBOL" where name in `datasetCode`reporterCode`partnerCode`partner2Code
update tbSchema set type="BOOL" where name in `isOriginalClassification`isQtyEstimated`isAltQtyEstimated`isNetWgtEstimated`isGrossWgtEstimated`isReported`isAggregate
update tbSchema set type="CHAR" where name in `legacyEstimationFlag
update tbSchema set type="SHORT" where name in `refYear`refMonth`period`mosCode`motCode

// create DFS databases
tmpTB=loadText(filename=dataFilePath, delimiter="\t", schema=tbSchema)
// TSDB
dbTSDB=database("dfs://comTradeTSDB", VALUE, `01`02, , engine="TSDB", atomic="TRANS")
ptTSDB=dbTSDB.createPartitionedTable(tmpTB, `commodity, `cmdCode, , sortColumns=`reporterCode`partnerCode`flowCode`refYear`refMonth`period`freqCode`refPeriodId, keepDuplicates=LAST)
ptTSDB.append!(tmpTB)
flushTSDBCache()
// OLAP
db=database("dfs://comTrade", VALUE, `01`02, , engine="OLAP", atomic="TRANS")
pt=db.createPartitionedTable(tmpTB, `commodity, `cmdCode)
pt.append!(tmpTB)
flushOLAPCache()

The cause of your issue lies in the unreasonable setting of the sort columns:

sortColumns=`reporterCode`partnerCode`flowCode`refYear`refMonth`period`freqCode`refPeriodId

For each combination of columns used as a sort key, metadata is generated for each column, such as the sum, max, count, min, and notnullcount. Therefore, if there is not enough data for each sort key, a large amount of metadata will be generated, resulting in the data expansion.

To address this issue:

  • use frequently queried entity IDs (eg stock IDs) as sort keys.
  • avoid using primary keys in relational databases as sort keys.

In the specific scenario mentioned earlier, use reporterCode and refPeriodId as the sort columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM