简体   繁体   English

猪无法读取自己的中间数据

[英]Pig cannot read its own intermediate data

First things first, I'm running Apache Pig version 0.11.0-cdh4.3.0 (rexported) according to the cluster. 首先,根据集群,我正在运行Apache Pig版本0.11.0-cdh4.3.0(rexported)。 My build however uses 0.11.0-cdh4.5.0 which I know isn't a smart decision but I do not think it is related to the issue I am experiencing here since it's both Pig v0.11.0 但是我的构建使用的是0.11.0-cdh4.5.0,我知道这不是一个明智的决定,但我认为这与我在此遇到的问题无关,因为它们都是Pig v0.11.0

I have a script which structurally looks like this (both custom udf's return the DataByteArray type, which is a valid Pig type afaik): 我有一个结构上看起来像这样的脚本(自定义udf都返回DataByteArray类型,这是有效的Pig类型afaik):

LOAD USING parquet.pig.ParquetLoader();

FOREACH GENERATE some of the fields

GROUP BY (a,b,c)

FOREACH GENERATE FLATTEN(group) AS (a,b,c), CustomUDF1(some_value) AS d

FOREACH GENERATE FLATTEN(CubeDimensions(a,b,c)) AS (a,b,c) , d

GROUP BY (a,b,c)

FOREACH GENERATE FLATTEN(group) AS (a,b,c), SUM(some_value), CustomUDF2(some_value)

STORE USING parquet.pig.ParquetStorer();

Pig splits this up in two mapreduce jobs. 猪将其分为两个mapreduce工作。 I'm not sure whether CubeDimensions happens in the first or in the second, but I suspect it happens in the reduce stage of the first job. 我不确定CubeDimensions是发生在第一个还是第二个中,但是我怀疑它发生在第一个工作的简化阶段中。

So the mapping stage of the second job does nothing more than reading the intermediate data, and that's where this happens : 因此,第二项工作的映射阶段只需要读取中间数据,这就是发生这种情况的地方:

"Unexpected data type 49 found in stream." “在流中找到意外的数据类型49。” @ org.apache.pig.data.BinInterSedes:422 @ org.apache.pig.data.BinInterSedes:422

I've seen the number be both 48 and 49 and neither exist in the BinInterSedes class : 我已经看到数字既是48也是49,并且在BinInterSedes类中都不存在:

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.pig/pig/0.11.0-cdh4.3.0/org/apache/pig/data/BinInterSedes.java?av=f http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.pig/pig/0.11.0-cdh4.3.0/org/apache/pig/data/BinInterSedes.java? AV = F

But since this is pig's own intermediate output, I don't quite get where it could have gone wrong. 但是,由于这是猪自己的中间输出,因此我不太了解它可能出错的地方。 Both my custom UDF's return a valid type, and I would expect Pig to definitely store only using types it knows. 我的自定义UDF都返回了一个有效的类型,我希望Pig肯定只使用它知道的类型来存储。

Any help would be greatly appreciated. 任何帮助将不胜感激。

It appears that by coincidence the sequence that is used for line splitting in Pig's intermediate storage, also occurs in one of the byte arrays that are returned by the custom UDFs. 似乎偶然地,Pig的中间存储中用于行拆分的序列也出现在自定义UDF返回的字节数组之一中。 This causes pig to break up the line somewhere in the middle, and start looking for a datatype indication. 这会导致Pig在中间的某个位置中断行,并开始寻找数据类型指示。 Since it's just in the middle of the line, there is no valid data type indication, hence the error. 由于它仅位于行的中间,因此没有有效的数据类型指示,因此会出现错误。

I'm not entirely sure yet how I am going to go about fixing this. 我还不确定,我将如何解决此问题。 @WinnieNicklaus already provided a good solution by splitting the script up in two and storing in between. @WinnieNicklaus已经通过将脚本一分为二并存储在其中来提供了一个很好的解决方案。 Another option would be to have the UDF return a Base64 encoded byte array. 另一个选择是让UDF返回Base64编码的字节数组。 That way there can never be a conflict with the PIG intermediate storage, since it uses CTRL-A, CTRL-B, CTRL-C, TUPLE-INDICATOR, none of which are alphanumerical characters. 这样,就不会与PIG中间存储区发生冲突,因为它使用CTRL-A,CTRL-B,CTRL-C,TUPLE-INDICATOR,它们都不是字母数字字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM