简体   繁体   中英

Spark Java: Escape dot in column names for vector assembler

I have a Dataset where some column names have dots. The problem arises when it comes to Vector Assembler. It seems that they do not get along, so I tried to escape the dots in many ways but nothing changed.

String[] expincols = newfilenameavgpeaks.columns();

VectorAssembler assemblerexp = new VectorAssembler()
                    .setInputCols(expincols)
                    .setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

I have wrapped every element in expincols with: "`", "``","```","````","'",'"', etc but nothing! I also tried these in the column names of newfilenameavgpeaks but still nothing. Any ideas how to escape?

If the dataset contains a column ab you can still use df.col(`ab`) to select a column with a . in its name. This works because Dataset.col tries to resolve the column name and can handle the backticks.

VectorAssembler.transform however takes the schema of the supplied dataset and uses this StructType to handle the column names in VectorAssembler.transformSchema . The apply method of StructType simply does not contain the logic to handle the backticks and throws an IllegalArgumentException if the column names do not match exactly.

Therefore the only option is to rename the columns before supplying them to the VectorAssembler:

Dataset<Row> newfilenameavgpeaks = ...

for( String col : newfilenameavgpeaks.columns()) {
    newfilenameavgpeaks = newfilenameavgpeaks
            .withColumnRenamed(col, col.replace('.', '_'));
}

VectorAssembler assemblerexp = new VectorAssembler()
    .setInputCols(newfilenameavgpeaks.columns()).setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM