Spark Java: Escape dot in column names for vector assembler

Question

I have a Dataset where some column names have dots. The problem arises when it comes to Vector Assembler. It seems that they do not get along, so I tried to escape the dots in many ways but nothing changed.

String[] expincols = newfilenameavgpeaks.columns();

VectorAssembler assemblerexp = new VectorAssembler()
                    .setInputCols(expincols)
                    .setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

I have wrapped every element in expincols with: "`", "``","```","````","'",'"', etc but nothing! I also tried these in the column names of newfilenameavgpeaks but still nothing. Any ideas how to escape?

Answer 1

If the dataset contains a column ab you can still use df.col(`ab`) to select a column with a . in its name. This works because Dataset.col tries to resolve the column name and can handle the backticks.

VectorAssembler.transform however takes the schema of the supplied dataset and uses this StructType to handle the column names in VectorAssembler.transformSchema . The apply method of StructType simply does not contain the logic to handle the backticks and throws an IllegalArgumentException if the column names do not match exactly.

Therefore the only option is to rename the columns before supplying them to the VectorAssembler:

Dataset<Row> newfilenameavgpeaks = ...

for( String col : newfilenameavgpeaks.columns()) {
    newfilenameavgpeaks = newfilenameavgpeaks
            .withColumnRenamed(col, col.replace('.', '_'));
}

VectorAssembler assemblerexp = new VectorAssembler()
    .setInputCols(newfilenameavgpeaks.columns()).setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

Spark Java: Escape dot in column names for vector assembler

Question

1 answers

solution1
0 ACCPTED 2020-09-26 15:23:53

Spark Java: Escape dot in column names for vector assembler

Question

1 answers

solution1 0 ACCPTED 2020-09-26 15:23:53

solution1
0 ACCPTED 2020-09-26 15:23:53