简体   繁体   中英

REGEXP_EXTRACT with URL in Hive

I want to extract a word between '/bla-bla-bla/' and 'a12345' in the URL, which is "this-is-the-word" using regexp_extract in Hive.

INPUT: www.website.com/bla-bla-bla/this-is-the-word.a12345.anotherword.blabla

DESIRED OUTPUT: this-is-the-word

I've tried below, but none of them worked. What RegEx will achieve my desired output from this input?

regexp_extract(URL,'^.*[/]bla[-]bla[-]bla[/]([a-z]+)\\.(a([0-9]+))*$',1)
regexp_extract(URL,'^.*[/]bla-bla-bla[/]([a-z]*)[.]a([0-9]+)*$',1)

You may use

regexp_extract(URL,'^.*/bla-bla-bla/([^/.]+)\.a[0-9].*$', 1)

See this regex demo

It matches

  • ^ - start of string
  • .* - any 0+ chars other than line break chars, as many as possible
  • /bla-bla-bla/ - a literal /bla-bla-bla/ substring
  • ([^/.]+) - Group 1 (what you will get since the next argument is 1 ): 1 or more chars other than / and .
  • \\.a - a .a substring
  • [0-9] - a digit
  • .*$ - the rest of the string to its end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM