Spark StringIndexer.fit在大型记录上非常慢

Question

I have large data records formatted as the following sample: 我有大型数据记录格式如下：

// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc|   123|  true|
// |abc|   345|  true|
// |abc|   567|  true|
// |def|   123|  true|
// |def|   345|  true|
// |def|   567|  true|
// |def|   789| false|
// +---+------+------+

cid and itemId are strings. cid和itemId是字符串。

There are 965,964,223 records. 有965,964,223条记录。

I am trying to convert cid to an integer using StringIndexer as follows: 我试图使用StringIndexer将cid转换为整数，如下所示：

dataset.repartition(50)
val cidIndexer = new StringIndexer().setInputCol("cid").setOutputCol("cidIndex")
val cidIndexedMatrix = cidIndexer.fit(dataset).transform(dataset)

But these lines of code are very slow (takes around 30 minutes). 但是这些代码行很慢（大约需要30分钟）。 The problem is that it is so huge that I could not do anything further after that. 问题是它太大了，以至于我之后无法做任何事情。

I am using amazon EMR cluster of R4 2XLarge cluster with 2 nodes (61 GB of memory). 我正在使用具有2个节点（61 GB内存）的R4 2XLarge集群的亚马逊EMR集群。

Is there any performance improvement that I can do further? 我可以做进一步的性能提升吗？ Any help will be much appreciated. 任何帮助都感激不尽。

Answer 1

That is an expected behavior, if cardinality of column is high. 如果列的基数很高，那么这是预期的行为。 As a part of the training process, StringIndexer collects all the labels, and to create label - index mapping (using Spark's oasutil.collection.OpenHashMap ). 作为培训过程的一部分， StringIndexer收集所有标签，并创建标签 - 索引映射（使用Spark的oasutil.collection.OpenHashMap ）。

This process requires O(N) memory in the worst case scenario, and is both computationally and memory intensive. 在最坏的情况下，该过程需要O（N）存储器，并且在计算和存储器方面都是密集的。

In cases where cardinality of the column is high, and its content is going to be used as feature, it is better to apply FeatureHasher (Spark 2.3 or later). 如果列的基数很高，并且其内容将用作特征，则最好应用FeatureHasher （Spark 2.3或更高版本）。

import org.apache.spark.ml.feature.FeatureHasher

val hasher = new FeatureHasher()
  .setInputCols("cid")
  .setOutputCols("cid_hash_vec")
hasher.transform(dataset)

It doesn't guarantee uniqueness and it is not reversible, but it is good enough for many applications, and doesn't require fitting process. 它不保证唯一性并且不可逆，但它对于许多应用来说已经足够好了，并且不需要装配过程。

For column that won't be used as a feature you can also use hash function: 对于不会用作功能的列，您还可以使用hash函数：

import org.apache.spark.sql.functions.hash

dataset.withColumn("cid_hash", hash($"cid"))

Answer 2

Assuming that: 假如说：

You plan to use the cid as a feature (after StringIndexer + OneHotEncoderEstimator) 您打算将cid用作功能（在StringIndexer + OneHotEncoderEstimator之后）
Your data sits in S3 您的数据位于S3中

A few questions first: 先问几个问题：

How many distinct values do you have in the cid column? cid列中有多少个不同的值？
What's the data format (eg Parquet, Csv, etc...) and is it splittable? 什么是数据格式（例如Parquet，Csv等......）并且可以拆分吗？
- See: https://community.hitachivantara.com/s/article/hadoop-file-formats-its-not-just-csv-anymore 请参阅： https ： //community.hitachivantara.com/s/article/hadoop-file-formats-its-not-just-csv-anymore

Without knowing much more, my first guess is that you should not worry about memory now and check your degree of parallelism first. 在不知情的情况下，我的第一个猜测是你现在不应该担心内存并首先检查你的并行度。 You only have 2 R4 2XLarge instances that will give you: 您只有2个R4 2XLarge实例可以为您提供：

8 CPUs 8个CPU
61GB Memory 61GB内存

Personally, I would try to either: 就个人而言，我会尝试：

Get more instances 获得更多实例
Swap the R4 2XLarge instances with others that have more CPUs 将R4 2XLarge实例与具有更多CPU的其他实例交换

Unfortunately, with the current EMR offering this can only be achieved by throwing money at the problem: 不幸的是，对于目前的EMR产品，这只能通过在问题上投入资金来实现：

Finally, what's the need to repartition(50) ? 最后，需要repartition(50) ？ That might just introduce further delays... 这可能会引入进一步的延误......

Spark StringIndexer.fit在大型记录上非常慢

问题描述

2 个解决方案

解决方案1
0 2019-09-11 11:32:37

解决方案2
0 2019-09-11 12:41:10

Spark StringIndexer.fit在大型记录上非常慢

问题描述

2 个解决方案

解决方案1 0 2019-09-11 11:32:37

解决方案2 0 2019-09-11 12:41:10

解决方案1
0 2019-09-11 11:32:37

解决方案2
0 2019-09-11 12:41:10