In Spark's Java API, how can I select columns from a Dataset<Row> using regular expressions?

Question

Link to accepted answer - https://stackoverflow.com/a/56447083/8543652

Using Spark's Java API, I want to select a subset of columns from an existing Dataset using a regular expression and house them in a new Dataset.

For example, suppose I have a Dataset with a large number of Columns:

String[] columnNames = exampleDF.columns();

where

columnNames = {foo1,foo2,...,foon,bar1,bar2,...,bark}

Suppose from the above I wish to create a new Dataset from exampleDF containing only the foo columns.

So far, I have tried creating a regular expression helper function using plain Java and then tried placing this into the Dataset's select method:

String[] filterColumns(Dataset<Row> inputDF, String regEx){

      // Get the column names, convert to a stream of strings
      Stream<String> columnStream = Arrays.stream(inputDF.columns());

      // Create a predicate from the desired regular expression
      Predicate<String> pred = Pattern.compile(regEx).asPredicate(); 

      // Filter the streamed string based on the predicate, then convert to an array
      return columnStream.filter(pred).collect(Collectors.toList()).toArray(new String[0]);

Dataset<Row> outputDF = exampleDF.select(filterColumns(exampleDF, "foo."));

I understand that a varargs can accept an array as an input, and I was hoping the select function would accept said array. However rather strangely, it appears as though you must first enter a string, then a string vararg.

For example if instead my helper function output an array as:

String[] cumbersomeArray = {foo2,foo3,...,foon}

I could input:

Dataset<Row> outputDF = exampleDF.select("foo1",filterColumns(outputDF, cumbersomeArray))

and that would work.

However, that is not very satisfactory because then I would have to modify the helper function to output a strange version of the array, thereby defeating the purpose of a helper function.

I have also tried the selectExpr method, but it seems to only take in a SQL-like expression.

I am also aware of Dataset's colRegex method, but couldn't find any examples or documentation (In fact, this is why I decided to try and implement my own helper function).

My questions are thus as follows:

1) Is it possible to change my helper function to output a String followed by a String[] such that I can place it straight into the Dataset's select method?

2) Alternatively, would my current helper function work as-is inside some other method?

3) Would colRegex or some other method I am unaware of help me here? If so, can you provide an example and documentation?

I would prefer to stick to native Java/Spark objects rather than rely on any third party libraries.

Answer 1

The Dataset select method you are using here is select(String col, String... cols) but you could use the select(Column... cols) method instead by returning an array of Columns from your helper function as opposed to an array of String .

change to:

return columnStream.filter(pred).map(x -> new Column(x)).collect(Collectors.toList()).toArray(new Column[0])

then you can just use the returned array like:

Dataset<Row> outputDF = exampleDF.select(filterColumns(outputDF, cumbersomeArray))

In Spark's Java API, how can I select columns from a Dataset<Row> using regular expressions?

Question

1 answers

solution1
0 ACCPTED 2019-06-04 15:34:52

In Spark's Java API, how can I select columns from a Dataset<Row> using regular expressions?

Question

1 answers

solution1 0 ACCPTED 2019-06-04 15:34:52

solution1
0 ACCPTED 2019-06-04 15:34:52