hive regexp_extract古怪

Question

I am having some problems with regexp_extract: 我在使用regexp_extract时遇到了一些问题：

I am querying on a tab-delimited file, the column I'm checking has strings that look like this: 我在一个制表符分隔的文件上查询，我正在检查的列包含如下所示的字符串：

abc.def.ghi

Now, if I do: 现在，如果我这样做：

select distinct regexp_extract(name, '[^.]+', 0) from dummy;

MR job runs, it works, and I get "abc" from index 0. MR作业运行，它工作，我从索引0得到“abc”。

But now, if I want to get "def" from index 1: 但现在，如果我想从索引1获得“def”：

select distinct regexp_extract(name, '[^.]+', 1) from dummy;

Hive fails with: Hive失败了：

2011-12-13 23:17:08,132 Stage-1 map = 0%,  reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Log file says: 日志文件说：

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row

Am I doing something fundamentally wrong here? 我在这里做了一些根本错误的事吗？

Thanks, Mario 谢谢，马里奥

Answer 1

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract. 从文档https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF看来，regexp_extract（）是您要提取的数据的记录/行提取。

It seems to work on a first found (then quit) as opposed to global. 它似乎在第一次找到（然后退出）而不是全局。 Therefore the index references the capture group. 因此，索引引用捕获组。

0 = the entire match 0 =整场比赛
1 = capture group 1 1 =捕获组1
2 = capture group 2, etc ... 2 =捕获组2等...

Paraphrased from the manual: 从手册中解释：

regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                  ^    ^   
               groups             1    2

This returns 'bar'.

So, in your case, to get the text after the dot, something like this might work: 因此，在您的情况下，要获得点后的文本，这样的事情可能会起作用：
regexp_extract(name, '\\.([^.]+)', 1)
or this 或这个
regexp_extract(name, '[.]([^.]+)', 1)

edit 编辑

I got re-interested in this, just a fyi, there could be a shortcut/workaround for you. 我对此重新感兴趣，只是一个fyi，可能有一个快捷方式/解决方法。

It looks like you want a particular segment separated with a dot . 看起来你想要一个用点分隔的特定段. character, which is almost like split. 性格，几乎像分裂。
Its more than likely the regex engine used overwrites a group if it is quantified more than once. 如果它被量化不止一次，那么使用的正则表达式引擎很可能会覆盖一个组。
You can take advantage of that with something like this: 你可以利用这样的东西来利用它：

Returns the first segment: abc .def.ghi 返回第一个段： abc .def.ghi
regexp_extract(name, '^(?:([^.]+)\\.?){1}', 1)

Returns the second segment: abc. 返回第二个段：abc。 def .ghi def .ghi
regexp_extract(name, '^(?:([^.]+)\\.?){2}', 1)

Returns the third segment: abc.def. 返回第三个段：abc.def。 ghi
regexp_extract(name, '^(?:([^.]+)\\.?){3}', 1)

The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes. 索引不会更改（因为索引仍然引用捕获组1），只有正则表达式重复更改。

Some notes: 一些说明：

This regex ^(?:([^.]+)\\.?){n} has problems though. 这个正则表达式^(?:([^.]+)\\.?){n}有问题。
It requires there be something between dots in the segment or the regex won't match ... . 它要求段中的点之间存在某些东西或正则表达式不匹配...
It could be this ^(?:([^.]*)\\.?){n} but this will match even if there is less than n-1 dots, 可能是这个^(?:([^.]*)\\.?){n}但即使小于n-1个点也会匹配，
including the empty string. 包括空字符串。 This is probably not desireable. 这可能不是理想的。

There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots. 有一种方法可以做到这一点，它不需要点之间的文本，但仍然需要至少n-1点。
This uses a lookahead assertion and capture buffer 2 as a flag. 这使用先行断言和捕获缓冲区2作为标志。

^(?:(?!\\2)([^.]*)(?:\\.|$())){2} , everything else is the same. ^(?:(?!\\2)([^.]*)(?:\\.|$())){2} ，其他一切都是一样的。

So, if it uses java style regex, then this should work. 所以，如果它使用java风格的正则表达式，那么这应该工作。
regexp_extract(name, '^(?:(?!\\2)([^.]*)(?:\\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2). regexp_extract(name, '^(?:(?!\\2)([^.]*)(?:\\.|$())){2}', 1)将{2}更改为'segment'是什么需要（这确实是第2段）。

and it still returns capture buffer 1 after the {N}'th iteration. 并且它在第{N}次迭代后仍然返回捕获缓冲区1。

Here it is broken down 在这里它被打破了

^                # Begining of string
 (?:             # Grouping
    (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
    ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
    (?:               # Grouping
        \.                # Dot character
      |                 # or,
        $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
    )                 # End grouping
 ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)

If it doesn't do assertions, then this won't work! 如果它没有做断言，那么这将不起作用！

Answer 2

I think you have to make 'groups' no? 我认为你必须让'团体'没有？

select distinct regexp_extract(name, '([^.]+)', 1) from dummy;

(untested) （另）

I think it behaves like the java library and this should work, let me know though. 我认为它的行为类似于java库，这应该可行，但请告诉我。

hive regexp_extract古怪

问题描述

2 个解决方案

解决方案1
33 已采纳 2011-12-13 23:30:22

解决方案2
1 2011-12-13 22:28:25

hive regexp_extract古怪

问题描述

2 个解决方案

解决方案1 33 已采纳 2011-12-13 23:30:22

解决方案2 1 2011-12-13 22:28:25

解决方案1
33 已采纳 2011-12-13 23:30:22

解决方案2
1 2011-12-13 22:28:25