[英]How to programmatically read schema from header file in jaql?
I am trying to achieve the following in JAQL and am stuck.我试图在 JAQL 中实现以下目标,但被卡住了。
I have two files: File data.tsv, which contains tab separated data, and a file header.tsv, which contains exactly one line with tab separated values, corresponding to the "header" of file data.tsv.我有两个文件:文件 data.tsv,它包含制表符分隔的数据,以及一个文件 header.tsv,它只包含一行带有制表符分隔值的行,对应于文件 data.tsv 的“标题”。
What I want to achieve is to read data.tsv using:我想要实现的是使用以下方法读取 data.tsv:
read(lines(location='data.tsv')) -> transform catch(delToJson($, {"schema": schema_json, "delimiter": "\t"}), {"errThresh":99999999999},$);
For this I need schema_json, a schema definition.为此,我需要 schema_json,一个架构定义。 I'd like to create this schema_json from file header.tsv (and assigning every field the type "string").我想从文件 header.tsv 创建这个 schema_json (并为每个字段分配类型“字符串”)。
Reading header.tsv is straight forward, and putting it into a record of type header_record = {"header1": string, "header2":string, ....}
as well.读取 header.tsv 很简单,并将其放入header_record = {"header1": string, "header2":string, ....}
类型的记录中。 However how do I transform the jaql record header_record to an object of type schema: schema_json = schema {"header1":string,"header2":string, ....}
?但是,如何将 jaql记录header_record 转换为schema类型的对象: schema_json = schema {"header1":string,"header2":string, ....}
?
OK, here is a very dirty workaround, that nevertheless does the trick.好的,这是一个非常肮脏的解决方法,但仍然可以解决问题。 I am still waiting for IBM support to get back to me with "the canonical way" (although I doubt this exists):我仍在等待 IBM 支持以“规范方式”回复我(尽管我怀疑这是否存在):
First, define path of the header file首先定义头文件的路径
HeaderFilePath = '/data/column_headers.tsv';
Then read the header file.然后读取头文件。 Output is an array.输出是一个数组。
HeaderFile = localRead(del(location=HeaderFilePath, delimiter = "\t"));
Now I construct two arrays of the same length as the HeaderFile array, in order to use them with arrayToRecord
in the next step.现在我构造了两个与 HeaderFile 数组长度相同的数组,以便在下一步arrayToRecord
它们与arrayToRecord
一起使用。 Why I construct two and not just one will be apparent later.为什么我构建两个而不是一个,稍后会很明显。
val_array = HeaderFile -> expand -> transform 'some string';
val_array2 = HeaderFile -> expand -> transform 'some other string';
The idea is to build an artificial record schema_record with the same schema as the data and then to get the schema via schemaof
, which then can be used as schema input for reading the data file.这个想法是构建一个与数据具有相同模式的人工记录 schema_record,然后通过schemaof
获取模式,然后可以将其用作模式输入以读取数据文件。 For this one can use为此可以使用
schema_record = arrayToRecord(HeaderFile -> expand,val_array)
Problems:问题:
a) schemaof(schema_record)
returns schema { * }?
a) schemaof(schema_record)
返回schema { * }?
. . This is because schemas can (seemingly) only be inferred from materialized data, ie one has to use schema_record := arrayToRecord(HeaderFile -> expand,val_array)
.这是因为模式可以(似乎)只能从物化数据中推断出来,即必须使用schema_record := arrayToRecord(HeaderFile -> expand,val_array)
。
b) Now, using schemaof(schema_record)
returns a schema. b) 现在,使用schemaof(schema_record)
返回一个模式。 Which is good.哪个好。 However, I don't understand why a schema function would do something like this, but the schema record looks something like "header1": @{const: "some string", fixed: 11} string
instead of the expected "header1": string
.但是,我不明白为什么模式函数会做这样的事情,但模式记录看起来像"header1": @{const: "some string", fixed: 11} string
而不是预期的"header1": string
。 Hence this "schema" is pretty much useless.因此,这个“模式”几乎没有用。 What is worse, there seems to be no way to manipulate that schema object, such that one might be able to remove the @{}
specifications.更糟糕的是,似乎没有办法操纵该架构对象,以至于人们可能能够删除@{}
规范。
Workaround: use function elementsOf
, which returns the schema of elements of an array of schemas.解决方法:使用函数elementsOf
,它返回架构数组的元素架构。 Meaning:意义:
elementsOf([schemaof({a:1,b:3}),{a:1,b:3}]);
>> schema {"a":@{const: 1, fixed: 1} long, "b":@{const: 3, fixed: 1} long}.
However, using schemas with different "const" and "fixed" records will force elementsOf
to fall back to a "raw" schema (without @{})但是,使用具有不同“const”和“fixed”记录的模式将强制elementsOf
回退到“原始”模式(没有 @{})
elementsOf([schemaof({a:1,b:3}),{a:45,b:32}])
>> schema {"a": long, "b": long}.
This is the "dirty workaround" that I use to achieve what I want.这是我用来实现我想要的“肮脏的解决方法”。 (And all this is due to a very strange understanding of what a schema is...) (所有这一切都是由于对模式是什么的非常奇怪的理解......)
schema_array := [arrayToRecord(HeaderFile -> expand, val_array),arrayToRecord(HeaderFile -> expand, val_array2)];
DataSchema := elementsOf(schemaof(schema_array));
Data = read(lines(location='/data/data.tsv')) -> transform catch(delToJson($,
{"schema": DataSchema, "delimiter": "\t"}), {"errThresh": 99999999999},$);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.