简体   繁体   English

如何在 hive 中将字符串转换为数组?

[英]how to convert string to array in hive?

The value of the column is like this:该列的值是这样的:

["a", "b", "c(d, e)"]

Here the value is string type.这里的值是字符串类型。 I wish to convert the string to array, and I tried with split (column_name, ',') .我希望将字符串转换为数组,并尝试使用split (column_name, ',') However because the element in the array contains the comma symbol (eg, "c(d, e)" ), it didn't work well.但是,由于数组中的元素包含逗号符号(例如, "c(d, e)" ),因此效果不佳。 Is there any other way to convert the string to array?有没有其他方法可以将字符串转换为数组?

In this case you can split by comma only between double-quotas.在这种情况下,您只能在双引号之间用逗号分隔。

REGEXP '(?<="), *(?=")' matching comma with optional space only between " and " , not including quotas. REGEXP '(?<="), *(?=")'仅在""之间匹配带有可选空格的逗号,不包括配额。

(?<=") is a zero-width lookbehind, asserts that what immediately precedes the current position in the string is " (?<=")是一个零宽度的lookbehind,断言字符串中当前 position 之前的内容是“

(?=") is a zero-width positive lookahead assertion, means it should be " after current position (?=")是一个零宽度的正向前瞻断言,意味着它应该在当前 position 之后

After splitting in such way, array will contain elements with quotes: ' "a" ', you may want to remove these quotes, use regexp_replace:以这种方式拆分后,数组将包含带引号的元素:' "a" ',您可能想要删除这些引号,使用 regexp_replace:

Demo:演示:

with your_data as (
  select '["a", "b", "c(d, e)"]' as str
) 

select split(str, '(?<="), *(?=")')       as splitted_array, 
       element, 
       regexp_replace(element,'^"|"$','') as element_unquotted
  from (
        select regexp_replace(str,'^\\[|\\]$','') as str --remove square brackets
         from your_data 
       ) d
       --explode array   
       lateral view explode(split(str, '(?<="), *(?=")')) e as element 

Result:结果:

 splitted_array                       element      element_unquotted
 ["\"a\"","\"b\"","\"c(d, e)\""]       "a"          a
 ["\"a\"","\"b\"","\"c(d, e)\""]       "b"          b
 ["\"a\"","\"b\"","\"c(d, e)\""]       "c(d, e)"    c(d, e)

And if you need array of unquoted elements, you can collect array again using collect_list.如果您需要未引用元素的数组,您可以使用 collect_list 再次收集数组。

Another way is to replace ", " with some delimiter, remove all other quotas and square brackets, and split.另一种方法是用一些分隔符替换“,”,删除所有其他配额和方括号,然后拆分。

Demo:演示:

with your_data as (
  select '["a", "b", "c(d, e)"]' as str
) 
select split(str,  '\\|\\|\\|') splitted_array 
  from (--replace '", ' with |||, remove all quotes, remove square brackets
         select regexp_replace(regexp_replace(str,'", *"','|||'),'^\\[|\\]$|"','') as str 
         from your_data ) d

Result:结果:

splitted_array
["a","b","c(d, e)"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM