简体   繁体   中英

HIVE - How to extract hashtags from a string in HIVE

IF we have a sample field say 'note' note could be in ( 'xyz #abc-xyz-dfg. #abd', '#123%36&89'

, '## avc #1 "Changed xyz #2 from "string"\n - stop #3 #'

)

What I am doing is - REGEXP_EXTRACT( regexp_replace( upper(note),'.?(\\#\\w+)' ' $1' ), '.(\\#\\w+)', 0 )

But what it is doing is:

if we have a tag as #abc-xyz it's only fetching #abc instead of #abc-xyz

or a tag as #abc-xyz. #abcd it's only fetching #abcd instead of #abc-xyz #abcd

Or if we have #1 or #2, it's skipping it, giving nulls

or if we have #123%36&89, it's giving #123, we need complete

\w is equivalent to [a-zA-Z0-9_] , which doesn't contain - , % , or & . Instead of using \w , you should define a character set that contains everything that you want to match, such as [a-zA-Z0-9_%&-] .

Note that - must be included as either the first or last character in the set in order for it to treated as a literal. When it exists between two characters it defines a range.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM