简体   繁体   English

Spark Java:向量汇编程序的列名中的转义点

[英]Spark Java: Escape dot in column names for vector assembler

I have a Dataset where some column names have dots.我有一个数据集,其中一些列名有点。 The problem arises when it comes to Vector Assembler.当涉及到 Vector Assembler 时,问题就出现了。 It seems that they do not get along, so I tried to escape the dots in many ways but nothing changed.似乎他们不相处,所以我试图以多种方式逃避这些点,但没有任何改变。

String[] expincols = newfilenameavgpeaks.columns();

VectorAssembler assemblerexp = new VectorAssembler()
                    .setInputCols(expincols)
                    .setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

I have wrapped every element in expincols with: "`", "``","```","````","'",'"', etc but nothing! I also tried these in the column names of newfilenameavgpeaks but still nothing. Any ideas how to escape?我用expincols包裹了每个元素:“`”,“`”,“```”,“````”,“'”,'”'等,但什么都没有!我也在专栏中尝试过这些newfilenameavgpeaks 的名称但仍然没有任何想法如何逃脱?

If the dataset contains a column ab you can still use df.col(`ab`) to select a column with a .如果数据集包含列ab ,你仍然可以使用df.col(`ab`)与选择栏. in its name.以其名义。 This works because Dataset.col tries to resolve the column name and can handle the backticks.这是有效的,因为Dataset.col尝试解析列名称并可以处理反引号。

VectorAssembler.transform however takes the schema of the supplied dataset and uses this StructType to handle the column names in VectorAssembler.transformSchema . VectorAssembler.transform但是需要所提供的数据集的架构,并使用此StructType来处理的列名VectorAssembler.transformSchema The apply method of StructType simply does not contain the logic to handle the backticks and throws an IllegalArgumentException if the column names do not match exactly. StructTypeapply 方法不包含处理反引号的逻辑,如果列名不完全匹配,则会抛出IllegalArgumentException

Therefore the only option is to rename the columns before supplying them to the VectorAssembler:因此,唯一的选择是在将列提供给 VectorAssembler 之前重命名它们:

Dataset<Row> newfilenameavgpeaks = ...

for( String col : newfilenameavgpeaks.columns()) {
    newfilenameavgpeaks = newfilenameavgpeaks
            .withColumnRenamed(col, col.replace('.', '_'));
}

VectorAssembler assemblerexp = new VectorAssembler()
    .setInputCols(newfilenameavgpeaks.columns()).setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM