简体   繁体   中英

Can I use regular expression in PARTITION BY?

  (
      ResponseRgBasketId          STRING,
      RawStandardisedLoadDateTime TIMESTAMP,
      InfoMartLoadDateTime        TIMESTAMP,
      Operaame               STRING,
      RequestTimestamp            TIMESTAMP,
      RequestSiteId               STRING,
      RequestSalePointId          STRING,
      RequestdTypeId       STRING,
      RequeetValue          DECIMAL(10,2),
      ResponsegTimestamp TIMESTAMP,
      RequessageId            STRING,
      RequestBasketId             STRING,
      ResponsesageId           STRING,
      RequestTransmitAttempt      INT,
      ResponseCode                STRING,
      RequestasketItems    INT,
      ResponseFinancialTimestamp  TIMESTAMP,
      RequeketJsonString     STRING,
      LoyaltyId                   STRING
  )
  USING DELTA
  PARTITIONED BY (RequestTimestamp)
  TBLPROPERTIES
  (
      delta.deletedFileRetentionDuration = "interval 1 seconds",
      delta.autoOptimize.optimizeWrite = true
  )

It has been partitioned by RequestTimestamp (2020-12-12T07:39:35.000+0000 ), but it has the format as below. Could I change the format to different format to something like 2020-12-34 in partition by?

在此处输入图像描述

Short answer: No regexp or other transformation is possible in PARTITIONED BY. The only solution is to apply substr(timestamp, 1, 10) during/before load. See also this answer: https://stackoverflow.com/a/64171676/2700344

Long answer:

No regexp is possible in PARTITIONED BY. No functions are allowed in table DDL, only type can be specified. Type in column specification works as constraint and at the same time can cause implicit type conversion. For example if you are loading strings into dates, it will be casted implicitly if possible and loaded into null default partition if not possible to cast. Also if you are loading BIGINT, it will be silently truncated to INT, as a result you will see corrupted data and duplicates.

Does the same implicit cast work with partitioned by? Let,s see:

DROP TABLE IF EXISTS test_partition;
CREATE TABLE IF NOT EXISTS test_partition (Id   int)
    partitioned by (dt date) --Hope timestamp will be truncated to DATE
;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table test_partition partition(dt)
select 1 as id, current_timestamp as dt;

show partitions test_partition;

Result (We expect timestamp truncated to DATE...):

dt=2021-03-24 10%3A19%3A19.985

No, it does not work. Tested the same with varchar(10) column with strings like yours. See short answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM