OLAP和TSDB数据库压缩机制的区别

Question

一个 460MB 的 CSV 文件被导入到具有相同模式的 OLAP 数据库和 TSDB 数据库中。 OLAP 数据库的大小是 107M，而 TSDB 的大小是惊人的 1.5G。 为什么会这样？ 有什么办法可以提高TSDB引擎的数据压缩率？

这是我的表创建脚本：

// specify table schema
tbSchema=extractTextSchema(fileDir + fileNames["filename"][0], "\t")
update tbSchema set type="SYMBOL" where name in `datasetCode`reporterCode`partnerCode`partner2Code
update tbSchema set type="BOOL" where name in `isOriginalClassification`isQtyEstimated`isAltQtyEstimated`isNetWgtEstimated`isGrossWgtEstimated`isReported`isAggregate
update tbSchema set type="CHAR" where name in `legacyEstimationFlag
update tbSchema set type="SHORT" where name in `refYear`refMonth`period`mosCode`motCode

// create DFS databases
tmpTB=loadText(filename=dataFilePath, delimiter="\t", schema=tbSchema)
// TSDB
dbTSDB=database("dfs://comTradeTSDB", VALUE, `01`02, , engine="TSDB", atomic="TRANS")
ptTSDB=dbTSDB.createPartitionedTable(tmpTB, `commodity, `cmdCode, , sortColumns=`reporterCode`partnerCode`flowCode`refYear`refMonth`period`freqCode`refPeriodId, keepDuplicates=LAST)
ptTSDB.append!(tmpTB)
flushTSDBCache()
// OLAP
db=database("dfs://comTrade", VALUE, `01`02, , engine="OLAP", atomic="TRANS")
pt=db.createPartitionedTable(tmpTB, `commodity, `cmdCode)
pt.append!(tmpTB)
flushOLAPCache()

Answer 1

你的问题的原因在于排序列的设置不合理：

sortColumns=`reporterCode`partnerCode`flowCode`refYear`refMonth`period`freqCode`refPeriodId

对于用作排序键的每个列组合，都会为每个列生成元数据，例如 sum、max、count、min 和 notnullcount。 因此，如果每个排序键的数据不足，就会产生大量的元数据，导致数据膨胀。

要解决此问题：

使用经常查询的实体 ID（例如股票 ID）作为排序键。
避免在关系数据库中使用主键作为排序键。

在前面提到的具体场景中，使用reporterCode和refPeriodId作为排序列。

OLAP和TSDB数据库压缩机制的区别

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-06-09 08:28:57

OLAP和TSDB数据库压缩机制的区别

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-06-09 08:28:57

解决方案1
1 已采纳 2023-06-09 08:28:57