简体   繁体   English

如何以编程方式从 jaql 的头文件中读取模式?

[英]How to programmatically read schema from header file in jaql?

I am trying to achieve the following in JAQL and am stuck.我试图在 JAQL 中实现以下目标,但被卡住了。

I have two files: File data.tsv, which contains tab separated data, and a file header.tsv, which contains exactly one line with tab separated values, corresponding to the "header" of file data.tsv.我有两个文件:文件 data.tsv,它包含制表符分隔的数据,以及一个文件 header.tsv,它只包含一行带有制表符分隔值的行,对应于文件 data.tsv 的“标题”。

What I want to achieve is to read data.tsv using:我想要实现的是使用以下方法读取 data.tsv:

read(lines(location='data.tsv')) -> transform catch(delToJson($, {"schema": schema_json, "delimiter": "\t"}), {"errThresh":99999999999},$);

For this I need schema_json, a schema definition.为此,我需要 schema_json,一个架构定义。 I'd like to create this schema_json from file header.tsv (and assigning every field the type "string").我想从文件 header.tsv 创建这个 schema_json (并为每个字段分配类型“字符串”)。

Reading header.tsv is straight forward, and putting it into a record of type header_record = {"header1": string, "header2":string, ....} as well.读取 header.tsv 很简单,并将其放入header_record = {"header1": string, "header2":string, ....}类型的记录中。 However how do I transform the jaql record header_record to an object of type schema: schema_json = schema {"header1":string,"header2":string, ....} ?但是,如何将 jaql记录header_record 转换为schema类型的对象 schema_json = schema {"header1":string,"header2":string, ....}

OK, here is a very dirty workaround, that nevertheless does the trick.好的,这是一个非常肮脏的解决方法,但仍然可以解决问题。 I am still waiting for IBM support to get back to me with "the canonical way" (although I doubt this exists):我仍在等待 IBM 支持以“规范方式”回复我(尽管我怀疑这是否存在):

First, define path of the header file首先定义头文件的路径

HeaderFilePath = '/data/column_headers.tsv';

Then read the header file.然后读取头文件。 Output is an array.输出是一个数组。

HeaderFile = localRead(del(location=HeaderFilePath, delimiter = "\t"));

Now I construct two arrays of the same length as the HeaderFile array, in order to use them with arrayToRecord in the next step.现在我构造了两个与 HeaderFile 数组长度相同的数组,以便在下一步arrayToRecord它们与arrayToRecord一起使用。 Why I construct two and not just one will be apparent later.为什么我构建两个而不是一个,稍后会很明显。

val_array = HeaderFile -> expand -> transform 'some string';
val_array2 = HeaderFile -> expand -> transform 'some other string';

The idea is to build an artificial record schema_record with the same schema as the data and then to get the schema via schemaof , which then can be used as schema input for reading the data file.这个想法是构建一个与数据具有相同模式的人工记录 schema_record,然后通过schemaof获取模式,然后可以将其用作模式输入以读取数据文件。 For this one can use为此可以使用

schema_record = arrayToRecord(HeaderFile -> expand,val_array)

Problems:问题:

a) schemaof(schema_record) returns schema { * }? a) schemaof(schema_record)返回schema { * }? . . This is because schemas can (seemingly) only be inferred from materialized data, ie one has to use schema_record := arrayToRecord(HeaderFile -> expand,val_array) .这是因为模式可以(似乎)只能从物化数据中推断出来,即必须使用schema_record := arrayToRecord(HeaderFile -> expand,val_array)

b) Now, using schemaof(schema_record) returns a schema. b) 现在,使用schemaof(schema_record)返回一个模式。 Which is good.哪个好。 However, I don't understand why a schema function would do something like this, but the schema record looks something like "header1": @{const: "some string", fixed: 11} string instead of the expected "header1": string .但是,我不明白为什么模式函数会做这样的事情,但模式记录看起来像"header1": @{const: "some string", fixed: 11} string而不是预期的"header1": string Hence this "schema" is pretty much useless.因此,这个“模式”几乎没有用。 What is worse, there seems to be no way to manipulate that schema object, such that one might be able to remove the @{} specifications.更糟糕的是,似乎没有办法操纵该架构对象,以至于人们可能能够删除@{}规范。

Workaround: use function elementsOf , which returns the schema of elements of an array of schemas.解决方法:使用函数elementsOf ,它返回架构数组的元素架构。 Meaning:意义:

elementsOf([schemaof({a:1,b:3}),{a:1,b:3}]); 
>> schema {"a":@{const: 1, fixed: 1} long, "b":@{const: 3, fixed: 1} long}.

However, using schemas with different "const" and "fixed" records will force elementsOf to fall back to a "raw" schema (without @{})但是,使用具有不同“const”和“fixed”记录的模式将强制elementsOf回退到“原始”模式(没有 @{})

elementsOf([schemaof({a:1,b:3}),{a:45,b:32}])
>> schema {"a": long, "b": long}.

This is the "dirty workaround" that I use to achieve what I want.这是我用来实现我想要的“肮脏的解决方法”。 (And all this is due to a very strange understanding of what a schema is...) (所有这一切都是由于对模式是什么的非常奇怪的理解......)

schema_array := [arrayToRecord(HeaderFile -> expand, val_array),arrayToRecord(HeaderFile -> expand, val_array2)];

DataSchema := elementsOf(schemaof(schema_array));

Data = read(lines(location='/data/data.tsv')) -> transform catch(delToJson($,
{"schema": DataSchema, "delimiter": "\t"}), {"errThresh": 99999999999},$);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Jaql-顶级操作员默认顺序 - Jaql - top operator default order 如何使用 Hive 显示信息架构? - How to display Information Schema using Hive? 如何使用WebHDFS knox groovy库下载文件? - How can I download a file using the WebHDFS knox groovy library? 如何从bigr.frame的索引中选择? - how to select by index from a bigr.frame? 如何以编程方式获取BIGSQL_HEAD的主机名? - how can I programmatically get the hostname for the BIGSQL_HEAD? 测试 BigInsights + Cloud Storage(如何在这两个组件上使用 nodejs) - Testing BigInsights + Cloud Storage (How to use nodejs over this two components) 如何在BigInsight Bluemix中创建Hive视图(或任何其他Ambari视图) - How to create Hive View in BigInsight Bluemix (or any other Ambari Views) 如何在BlueMix上为“ BigInsights基本服务计划”向IOP集群添加定制服务? - How to add a custom service into IOP cluster for “BigInsights basic service plan” on BlueMix? 您如何在BlueMix上修改IOP群集上“ BigInsights基本服务计划”的配置? - How do you modify configurations on IOP cluster for “BigInsights basic service plan” on BlueMix? 如何将数据帧中的数据写入 HDFS 中的单个 .parquet 文件(单个文件中的数据和元数据)? - How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM