简体   繁体   中英

Filter only elements that matches a regex in Athena

I have a column in a table which is of string type and the column is , separated.

Sample Input: 'ASEIAW,1245555,asda2dd,TPOIBV'
Expected output: ['ASEIAW,TPOIBV'] - An array with all matching elements which is an alphabet in upper case with exactly 6 charterers.

What I tried;

select REGEXP_EXTRACT('ASEIOW,ASDWQB,TPOIBV,2' , '(\b[A-Z]{6,6}\b)+');
Output: ASEIOW

select REGEXP_LIKE('ASEIOW,ASDWQB,TPOIBV,2' , '(\b[A-Z]{6,6}\b)+');
Output: [v]

select REGEXP_SPLIT('ASEIOW,ASDWQB,TPOIBV,2' , '(\b[A-Z]{6,6}\b)+');
Output: ['',',',',',',2']

Using NOT in front of regex

select REGEXP_SPLIT('ASEIOW,ASDWQB,TPOIBV,2' , '^(\b[A-Z]{6,6}\b)+');
Output: ['',',ASDWQB,TPOIBV,2']

select REGEXP_REPLACE('ASEIOW,ASDWQB,TPOIBV,2' , '^(\b[A-Z]{6,6}\b)+');
Output: ,ASDWQB,TPOIBV,2

You can use a REGEX_EXTRACT_ALL with a bit simplified regex:

REGEXP_EXTRACT_ALL('ASEIOW,ASDWQB,TPOIBV,2' , '\b([A-Z]{6})\b');

The REGEXP_EXTRACT_ALL function will extract all occurrences of the pattern matches and \b([AZ]{6})\b just matches six letters enclosed with word boundaries, no need to specify the identical min and max values in range quantifiers. Nor do you need to additionally quantify the pattern.

You can use REGEX_EXTRACT_ALL :

select regexp_extract_all('ASEIOW,ASDWQB,TPOIBV,2', '\b([A-Z]{6})\b');

Output:

_col0
[ASEIOW, ASDWQB, TPOIBV]

Good day. U can use nested regexp_replace like this.

SELECT sentences,REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(sentences,'.[[:lower:]0-9]'),'^','['''),'$',''']') as result
          FROM  
           (   
            SELECT 'ASEIAW,1245555,asda2dd,TPOIBV' AS sentences from dual
            )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM