Hive中的regexp_extract参数

Question

What does the argument in the curly brackets do in the following segment of code? 大括号中的参数在下面的代码段中做了什么？

regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) Id,  
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) Score,  
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) DisplayName,

Answer 1

As you can read here , curly brackets contain how many times the preceding token, in this case a non-capturing group , may repeat. 如您在此处阅读的内容，大括号中包含前一个令牌（在这种情况下为非捕获组）可能重复的次数。

The group contains a (possibly empty) capturing group made of non-comma characters, followed by an optional comma. 该组包含一个（可能为空）捕获组，该组由非逗号字符组成，后跟一个可选的逗号。 Since there is only one number in curly brackets, the non-capturing group must repeat exactly that number of times. 由于大括号中只有一个数字，因此非捕获组必须准确重复该次数。

I don't know why the comma should be escaped by a backslash. 我不知道为什么逗号应该用反斜杠来逃避。 It seems to me that the backslash is not necessary. 在我看来，反斜杠是没有必要的。

Caveat: I don't know Hadoop or Hive, all my knowledge of regexp_extract comes from this page . 警告：我不了解Hadoop或Hive，我对regexp_extract所有知识都来自此页面。

The purpose of these regexes is to match the first, second and ninth element in a comma-separated list, where the capturing group #1 (selected by the third argument of regexp_extract ) returns only its last occurrence. 这些正则表达式的目的是匹配逗号分隔列表中的第一个，第二个和第九个元素，其中捕获组＃1（由regexp_extract的第三个参数regexp_extract ）仅返回其最后一次出现。 Of course the comma is not really optional, except after the last element. 当然，逗号不是真正可选的，除了最后一个元素之后。

Hive中的regexp_extract参数

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-03-19 23:01:23

Hive中的regexp_extract参数

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-03-19 23:01:23

解决方案1
0 已采纳 2016-03-19 23:01:23