為什么`plain_dictionary` 編碼的字典頁面偏移量為0？

Question

Parquet 由 Spark v2.4 Parquet-mr v1.10 生成

n = 10000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é', u'a', None, u'a'] * n

z = np.random.rand(len(x)).tolist()
dfs = spark.createDataFrame(zip(x, y, z), schema=StructType([StructField('x', DoubleType(),True),StructField('y', StringType(), True),StructField('z', DoubleType(), False)]))
dfs.repartition(1).write.mode('overwrite').parquet('test_spark.parquet')

使用parquet-tools v1.12 進行檢查

row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:864/16573/19.18 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
z:  DOUBLE SNAPPY DO:0 FPO:2500 SZ:560097/560067/1.00 VC:70000 ENC:PLAIN,BIT_PACKED ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000

    y TV=70000 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000] SZ:16514 VC:70000

    z TV=70000 RL=0 DL=0
    ----------------------------------------------------------------------------
    page 0:                   DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0] SZ:560000 VC:70000

題：

FPO（第一個數據頁偏移量）應該總是大於還是小於 DO（字典頁偏移量）？ 我從某處讀取字典頁面存儲在數據頁面之后。

對於列x & y ， plain_dictionary用於編碼。 但是，為什么兩列的字典偏移量為 0？

如果我使用使用 parquet-cpp v1.5.1 的 pyarrow v0.11.1 檢查，它會告訴我has_dictionary_page: False & dictionary_page_offset: None

它有沒有字典頁？

Answer 1

第一個數據頁的偏移量總是大於字典的偏移量。 換句話說，字典首先出現，然后才是數據頁。 有兩個元數據字段用於存儲這些偏移量： dictionary_page_offset （又名 DO）和data_page_offset （又名 FPO）。 不幸的是，parquet-mr 沒有正確填寫這些元數據字段。

例如，如果字典從偏移 1000 開始，第一個數據頁從偏移 2000 開始，那么正確的值應該是：

dictionary_page_offset = 1000
data_page_offset = 2000

相反，parquet-mr 商店

dictionary_page_offset = 0
data_page_offset = 1000

應用於您的示例，這意味着盡管鑲木地板工具顯示DO: 0 ，但列 x 和 y 仍然是字典編碼的（列 z 不是）。

值得一提的是，Impala 正確地遵循了規范，因此您不能依賴每個文件都有此缺陷。

這就是 parquet-mr 在閱讀過程中處理這種情況的方式：

// TODO: this should use getDictionaryPageOffset() but it isn't reliable.
if (f.getPos() != meta.getStartingPos()) {
  f.seek(meta.getStartingPos());
}

其中getStartingPos定義為：

/**
 * @return the offset of the first byte in the chunk
 */
public long getStartingPos() {
  long dictionaryPageOffset = getDictionaryPageOffset();
  long firstDataPageOffset = getFirstDataPageOffset();
  if (dictionaryPageOffset > 0 && dictionaryPageOffset < firstDataPageOffset) {
    // if there's a dictionary and it's before the first data page, start from there
    return dictionaryPageOffset;
  }
  return firstDataPageOffset;
}

您可以在此處的上下文中看到這些代碼行： ParquetFileReader.readDictionary 、 ColumnChunkMetaData.getStartingPos 。

為什么`plain_dictionary` 編碼的字典頁面偏移量為0？

問題描述

1 個解決方案

解決方案1
2 已采納 2019-03-18 17:12:58

為什么`plain_dictionary` 編碼的字典頁面偏移量為0？

問題描述

1 個解決方案

解決方案1 2 已采納 2019-03-18 17:12:58

解決方案1
2 已采納 2019-03-18 17:12:58