解析并加载到 Hive/Hadoop

Question

i am new to hadoop map reduce framework, and I am thinking of using hadoop map reduce to parse my data.我是 hadoop map 减少框架的新手，我正在考虑使用 hadoop Z1D78DC8ED51214E518AEZB511 来减少我的数据。 I have thousands of big delimited files for which I am thinking of writing a map reduce job to parse those files and load them into hive datawarehouse.我有数以千计的大分隔文件，我正在考虑为其编写一个 map 缩减作业来解析这些文件并将它们加载到 hive 数据仓库中。 I have written a parser in perl which can parse those files.我在 perl 中编写了一个解析器，它可以解析这些文件。 But I am stuck at doing the same with Hadoop map reduce但我坚持用 Hadoop map reduce 做同样的事情

For example: I have a file like x=ay=bz=c..... x=py=qz=s..... x=1 z=2.... and so on例如：我有一个像 x=ay=bz=c..... x=py=qz=s..... x=1 z=2.... 这样的文件

Now I have to load this file as columns (x,y,z) in hive table, but I am not able to figure out can I proceed with it.现在我必须将此文件作为 hive 表中的列 (x,y,z) 加载，但我不知道是否可以继续。 Any guidance with this would be really helpful.任何与此相关的指导都会非常有帮助。

Another problem in doing this is there are some files where the field y is missing.这样做的另一个问题是有些文件缺少字段 y。 I have to include that condition in the map reduce job.我必须在 map 减少工作中包含该条件。 So far, I have tried using streaming.jar and giving my parser.pl as mapper as input to that jar file.到目前为止，我已经尝试使用 streaming.jar 并将我的 parser.pl 作为映射器作为 jar 文件的输入。 I think that is not the way to do it:), but I was just trying if that would work.我认为这不是这样做的方法:)，但我只是在尝试是否可行。 Also, I thought of using load function of Hive, but the missing column will create problem if I will specify regexserde in hive table.另外，我想过使用 Hive 的负载 function，但是如果我在 hive 表中指定 regexserde，缺少的列会产生问题。

I am lost in this now, if any one could guide me with this I would be thankful:)我现在迷失了，如果有人可以指导我，我将不胜感激:)

Regards, Atul问候，阿图尔

Answer 1

I posted something a while ago to my blog a while ago.不久前我在我的博客上发布了一些东西。 (Google "hive parse_url" should be in the top few) （谷歌“hive parse_url”应该排在前几位）

I was parsing urls but in this case you will want to use str_to_map .我正在解析 url，但在这种情况下，您将需要使用str_to_map 。

str_to_map(arg1, arg2, arg3)

arg1 => String to process arg1 => 要处理的字符串
arg2 => Key Value Pair separator arg2 => 键值对分隔符
arg3 => Key Value separator arg3 => 键值分隔符

str = "a=1 b=42 x=abc"
str_to_map(str, " ", "=")

The result of str_to_map will give you a map<str, str> of 3 key-value pairs. str_to_map的结果将为您提供一个包含 3 个键值对的map<str, str> 。

str_to_map(str, " ", "=")["a"] --will return "1"

str_to_map(str, " ", "=")["b"] --will return "42"

We can pass this to Hive via:我们可以通过以下方式将其传递给 Hive：

INSERT OVERWRITE TABLE new_table_with_cols_x_y_z
(select params["x"], params["y"], params["z"] 
 from (
   select str_to_map(raw_line," ","=") as params from data
 ) raw_line_from_data
) final_data

解析并加载到 Hive/Hadoop

问题描述

1 个解决方案

解决方案1
7 已采纳 2011-07-07 13:58:14

解析并加载到 Hive/Hadoop

问题描述

1 个解决方案

解决方案1 7 已采纳 2011-07-07 13:58:14

解决方案1
7 已采纳 2011-07-07 13:58:14