简体   繁体   English

Apache Pig-在脚本中多次调用Java UDF ToJSON

[英]Apache Pig - calling Java UDF ToJSON multiple times in a script

(First post!) (第一篇文章!)

I've been playing with an example resume dataset. 我一直在玩示例简历数据集。 The resume object is somewhat complex, with multiple sub-objects. 简历对象有些复杂,有多个子对象。 For the current phase of my plan, I'm trying to flatten the dataset by storing the sub-objects as JSON strings. 对于我的计划的当前阶段,我试图通过将子对象存储为JSON字符串来展平数据集。 I'm running into a schema issue with the ToJSON UDF. 我遇到了ToJSON UDF的架构问题。 ( https://github.com/rjurney/pig-to-json ) https://github.com/rjurney/pig-to-json

If I do the following statement in my Pig script, I get the right data in my fields, but it reuses the Positions field names for all the ToJSson() calls: 如果我在Pig脚本中执行以下语句,则会在字段中获取正确的数据 ,但是它将为所有ToJSson()调用重用Positions字段名称:

stringifiedJSON =
FOREACH fullJSON
GENERATE
id ..  TotalYears,
com.hortonworks.pig.udf.ToJson(Awards) AS Awards:chararray,
com.hortonworks.pig.udf.ToJson(Certifications) AS Certifications:chararray,
CASE WHEN Degrees IS NULL THEN ‘[]’ ELSE com.hortonworks.pig.udf.ToJson(Degrees) END AS Degrees:chararray,
com.hortonworks.pig.udf.ToJson(Links) AS Links:chararray,
com.hortonworks.pig.udf.ToJson(Groups) AS Groups:chararray,
com.hortonworks.pig.udf.ToJson(MilitaryService) AS MilitaryService:chararray,
com.hortonworks.pig.udf.ToJson(Positions) AS Positions:chararray;

If I describe the “fullJSON” dataset, here's what I get in return ("…" are other fields not really relevant to the discussion): 如果我描述“ fullJSON”数据集,这是我得到的回报(“…”是与讨论无关的其他字段):

fullJSON:
{
id: chararray,
..
Awards: {award: (AwardDate: chararray,AwardDescription: chararray,AwardTitle: chararray)},
Certifications: {certification: (CertDescription: chararray,CertEndDate: chararray,CertStartDate: chararray,CertTitle: chararray)},
…
Degrees: {(DegreeTitle: chararray,DegreeEndDate: chararray,DegreeStartDate: chararray,School: chararray,SchoolCity: chararray,SchoolState: chararray,DegreeEducationLevel: chararray)},
…
Links: {link: (LinkTitle: chararray,LinkURL: chararray)},
Groups: {group: (GroupDescription: chararray,GroupEndDate: chararray,GroupStartDate: chararray,GroupTitle: chararray)},
…
MilitaryService: {military_service: (MilitaryBranch: chararray,MilitaryCommendations: chararray,MilitaryCountry: chararray,MilitaryDescripton: chararray,MilitaryStartDate: chararray,MilitaryEndDate: chararray,MilitaryRank: chararray)},
…
Positions: {(Company: chararray,CompanyCity: chararray,CompanyState: chararray,JobStartDate: chararray,JobEndDate: chararray,JobTitle: chararray,IsCurrentTitle: int)},
…
}

Anyone got any ideas? 任何人有任何想法吗? I tried splitting the ToJson() calls each into their own step, but I got the same results. 我尝试将ToJson()调用分成各自的步骤,但得到的结果相同。

I later played with the source code of ToJSON.java a bit, and I think I've narrowed it down to the following bit of code. 后来我稍微玩了一下ToJSON.java的源代码,并且我想将其范围缩小到以下代码。 I had added a log output of strSchema immediately after this, and it always returned the same information (that of the positions). 在此之后,我立即添加了strSchema的日志输出,并且它始终返回相同的信息(位置信息)。

if (myProperties == null) {
    // Retrieve our class specific properties from UDFContext
    myProperties = UDFContext.getUDFContext().getUDFProperties(this.getClass());
    }

String strSchema = myProperties.getProperty("horton.json.udf.schema");

Here's a sample of the stringifiedJSON output: 这是stringifiedJSON输出的示例:

{
  "id":"http://something.com/some_guy",
  ...
  "Awards":"[]",
  "Certifications":"[]",
  "Degrees":"[{\"CompanyState\":null,\"CompanyCity\":null,\"JobEndDate\":\"\",\"IsCurrentTitle\":\"Bachelor's Degree\",\"JobTitle\":\"\",\"Company\":\"BS in Marketing\",\"JobStartDate\":\"State University\"}]",
  "Links":"[]",
  "Groups":"[]",
  "MilitaryService":"[]",
  "Positions":"[{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Scottsdale\",\"JobEndDate\":\"2010-03-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Job runner\",\"Company\":\"somecompany\",\"JobStartDate\":\"2005-06-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Scottsdale\",\"JobEndDate\":\"2010-03-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Sales Rep\",\"Company\":\"Company2\",\"JobStartDate\":\"2005-06-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":\"2004-12-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Job 3\",\"Company\":\"Company3\",\"JobStartDate\":\"1991-05-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":\"2004-12-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"CompanyRep\",\"Company\":\"Company4\",\"JobStartDate\":\"1991-05-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":null,\"IsCurrentTitle\":null,\"JobTitle\":\"Job5\",\"Company\":\"Company5\",\"JobStartDate\":\"2014-09-01T00:00:00.000Z\"}]"
}

Here's what I wound up doing. 这是我最后要做的事情。 I'd much rather a different way to accomplish it, but it works. 宁愿用不同的方式来完成它,但它的作品。 I would rather not have to make 7 different DEFINE calls at the beginning and just be able to call the function itself and have it just work. 我宁愿不必在开始时进行7个不同的DEFINE调用,而只需能够调用函数本身并使它正常工作。

I added a string called signature and a constructor to the class: 我在类中添加了一个称为签名的字符串和一个构造函数:

String signature = null;
public ToJson(String Signature) {
    signature = Signature;
}

I modified the outputSchema() of the class. 我修改了该类的outputSchema()。 I added the signature to the getUDFProperties: 我将签名添加到getUDFProperties中:

Properties udfProp = context.getUDFProperties(this.getClass(),new String[]{signature});

And I similarly modified the exec(): 我同样修改了exec():

myProperties = UDFContext.getUDFContext().getUDFProperties(this.getClass(),new String[]{signature});

Then, in the pig script itself, I added several DEFINE clauses: 然后,在Pig脚本本身中,我添加了几个DEFINE子句:

DEFINE awardToJson com.hortonworks.pig.udf.ToJson('award');
DEFINE certToJson com.hortonworks.pig.udf.ToJson('cert');
DEFINE degreeToJson com.hortonworks.pig.udf.ToJson('degree');
DEFINE linkToJson com.hortonworks.pig.udf.ToJson('link');
DEFINE groupToJson com.hortonworks.pig.udf.ToJson('group');
DEFINE militaryToJson com.hortonworks.pig.udf.ToJson('military');
DEFINE positionToJson com.hortonworks.pig.udf.ToJson('position');

Then I adjusted the function calls in the pig script: 然后我调整了Pig脚本中的函数调用:

stringifiedJSON =
  FOREACH fullJSON
  GENERATE
  id .. TotalYears,
  awardToJson(Awards) AS Awards:chararray,
  certToJson(Certifications) AS Certifications:chararray,
  CASE WHEN Degrees IS NULL THEN '[]' ELSE degreeToJson(Degrees) END AS Degrees:chararray,
  linkToJson(Links) AS Links:chararray,
  groupToJson(Groups) AS Groups:chararray,
  militaryToJson(MilitaryService) AS MilitaryService:chararray,
  positionToJson(Positions) AS Positions:chararray
  ;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM