Spark StringIndexer.fit is very slow on large records

Question

I have large data records formatted as the following sample:

// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc|   123|  true|
// |abc|   345|  true|
// |abc|   567|  true|
// |def|   123|  true|
// |def|   345|  true|
// |def|   567|  true|
// |def|   789| false|
// +---+------+------+

cid and itemId are strings.

There are 965,964,223 records.

I am trying to convert cid to an integer using StringIndexer as follows:

dataset.repartition(50)
val cidIndexer = new StringIndexer().setInputCol("cid").setOutputCol("cidIndex")
val cidIndexedMatrix = cidIndexer.fit(dataset).transform(dataset)

But these lines of code are very slow (takes around 30 minutes). The problem is that it is so huge that I could not do anything further after that.

I am using amazon EMR cluster of R4 2XLarge cluster with 2 nodes (61 GB of memory).

Is there any performance improvement that I can do further? Any help will be much appreciated.

Answer 1

That is an expected behavior, if cardinality of column is high. As a part of the training process, StringIndexer collects all the labels, and to create label - index mapping (using Spark's oasutil.collection.OpenHashMap ).

This process requires O(N) memory in the worst case scenario, and is both computationally and memory intensive.

In cases where cardinality of the column is high, and its content is going to be used as feature, it is better to apply FeatureHasher (Spark 2.3 or later).

import org.apache.spark.ml.feature.FeatureHasher

val hasher = new FeatureHasher()
  .setInputCols("cid")
  .setOutputCols("cid_hash_vec")
hasher.transform(dataset)

It doesn't guarantee uniqueness and it is not reversible, but it is good enough for many applications, and doesn't require fitting process.

For column that won't be used as a feature you can also use hash function:

import org.apache.spark.sql.functions.hash

dataset.withColumn("cid_hash", hash($"cid"))

Answer 2

Assuming that:

You plan to use the cid as a feature (after StringIndexer + OneHotEncoderEstimator)
Your data sits in S3

A few questions first:

How many distinct values do you have in the cid column?
What's the data format (eg Parquet, Csv, etc...) and is it splittable?
- See: https://community.hitachivantara.com/s/article/hadoop-file-formats-its-not-just-csv-anymore

Without knowing much more, my first guess is that you should not worry about memory now and check your degree of parallelism first. You only have 2 R4 2XLarge instances that will give you:

8 CPUs
61GB Memory

Personally, I would try to either:

Get more instances
Swap the R4 2XLarge instances with others that have more CPUs

Unfortunately, with the current EMR offering this can only be achieved by throwing money at the problem:

Finally, what's the need to repartition(50) ? That might just introduce further delays...

Spark StringIndexer.fit is very slow on large records

Question

2 answers

solution1
0 2019-09-11 11:32:37

solution2
0 2019-09-11 12:41:10

Spark StringIndexer.fit is very slow on large records

Question

2 answers

solution1 0 2019-09-11 11:32:37

solution2 0 2019-09-11 12:41:10

solution1
0 2019-09-11 11:32:37

solution2
0 2019-09-11 12:41:10