I am using regex function in Impala to find the folder name in file path but it doesn't seem to give me correct result
I want to parse out "one" from this file path:
/this/one/path/to/hdfs
This is the regex which I used:
regexp_extract(filepath,'[/]+',0)
If here we wish to capture the /
, then we might just want to try ([\\/]+)
. There should be other expressions to extract one
also, such as:
(?:\/[a-z]+\/)(.+?)(?:\/.+)
and our code might look like:
regexp_extract(filepath, '(?:\/[a-z]+\/)(.+?)(?:\/.+)', 2)
or
regexp_extract(filepath, '(?:\/.+?\/)(.+?)(?:\/.+)', 2)
In this case, we are not capturing what is behind one
using a non-capturing group:
(?:\/[a-z]+\/)
then we capture one
using:
(.+?)
and finally we add a right boundary after one
in another non-capturing group:
(?:\/.+)
jex.im visualizes regular expressions:
Depending on which slash, one
might be located, we can modify our expression. For example, in this case, this expression also might be working:
(?:\/.+?\/)(.+?)(?:\/.+)
The latest Impala versions use RE2 regex library , and you may easily access capturing group values using the third argument in the regex_extract
function .
Use the following regex:
^/[^/]+/([^/]+)
See the regex demo (note that Go regex flavor is also RE2, that is why this option is selected at regex101). It matches
^
- start of string /
- a /
char (no regex delimiters in Impala regex string, hence no need to escape /
chars in the pattern) [^/]+
- any 1 or more chars other than /
/
- a /
char ([^/]+)
- Capturing group 1 (to get it, the index
argument must be set to 1
): any 1 or more chars other than /
Code:
regexp_extract(filepath, '^/[^/]+/([^/]+)', 1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.