简体   繁体   English

Hive中的regexp_extract参数

[英]regexp_extract arguments in Hive

What does the argument in the curly brackets do in the following segment of code? 大括号中的参数在下面的代码段中做了什么?

regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) Id,  
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) Score,  
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) DisplayName,

As you can read here , curly brackets contain how many times the preceding token, in this case a non-capturing group , may repeat. 如您在此处阅读的内容,大括号中包含前一个令牌(在这种情况下为非捕获组 )可能重复的次数。

The group contains a (possibly empty) capturing group made of non-comma characters, followed by an optional comma. 该组包含一个(可能为空) 捕获组,该组由非逗号字符组成,后跟一个可选的逗号。 Since there is only one number in curly brackets, the non-capturing group must repeat exactly that number of times. 由于大括号中只有一个数字,因此非捕获组必须准确重复该次数。

I don't know why the comma should be escaped by a backslash. 我不知道为什么逗号应该用反斜杠来逃避。 It seems to me that the backslash is not necessary. 在我看来,反斜杠是没有必要的。

Caveat: I don't know Hadoop or Hive, all my knowledge of regexp_extract comes from this page . 警告:我不了解Hadoop或Hive,我对regexp_extract所有知识都来自此页面

The purpose of these regexes is to match the first, second and ninth element in a comma-separated list, where the capturing group #1 (selected by the third argument of regexp_extract ) returns only its last occurrence. 这些正则表达式的目的是匹配逗号分隔列表中的第一个,第二个和第九个元素,其中捕获组#1(由regexp_extract的第三个参数regexp_extract )仅返回其最后一次出现。 Of course the comma is not really optional, except after the last element. 当然,逗号不是真正可选的,除了最后一个元素之后。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM