[英]Spark StringIndexer.fit is very slow on large records
I have large data records formatted as the following sample: 我有大型数据记录格式如下:
// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc| 123| true|
// |abc| 345| true|
// |abc| 567| true|
// |def| 123| true|
// |def| 345| true|
// |def| 567| true|
// |def| 789| false|
// +---+------+------+
cid
and itemId
are strings. cid
和itemId
是字符串。
There are 965,964,223 records. 有965,964,223条记录。
I am trying to convert cid
to an integer using StringIndexer
as follows: 我试图使用
StringIndexer
将cid
转换为整数,如下所示:
dataset.repartition(50)
val cidIndexer = new StringIndexer().setInputCol("cid").setOutputCol("cidIndex")
val cidIndexedMatrix = cidIndexer.fit(dataset).transform(dataset)
But these lines of code are very slow (takes around 30 minutes). 但是这些代码行很慢(大约需要30分钟)。 The problem is that it is so huge that I could not do anything further after that.
问题是它太大了,以至于我之后无法做任何事情。
I am using amazon EMR cluster of R4 2XLarge cluster with 2 nodes (61 GB of memory). 我正在使用具有2个节点(61 GB内存)的R4 2XLarge集群的亚马逊EMR集群。
Is there any performance improvement that I can do further? 我可以做进一步的性能提升吗? Any help will be much appreciated.
任何帮助都感激不尽。
That is an expected behavior, if cardinality of column is high. 如果列的基数很高,那么这是预期的行为。 As a part of the training process,
StringIndexer
collects all the labels, and to create label - index mapping (using Spark's oasutil.collection.OpenHashMap
). 作为培训过程的一部分,
StringIndexer
收集所有标签,并创建标签 - 索引映射(使用Spark的oasutil.collection.OpenHashMap
)。
This process requires O(N) memory in the worst case scenario, and is both computationally and memory intensive. 在最坏的情况下,该过程需要O(N)存储器,并且在计算和存储器方面都是密集的。
In cases where cardinality of the column is high, and its content is going to be used as feature, it is better to apply FeatureHasher
(Spark 2.3 or later). 如果列的基数很高,并且其内容将用作特征,则最好应用
FeatureHasher
(Spark 2.3或更高版本)。
import org.apache.spark.ml.feature.FeatureHasher
val hasher = new FeatureHasher()
.setInputCols("cid")
.setOutputCols("cid_hash_vec")
hasher.transform(dataset)
It doesn't guarantee uniqueness and it is not reversible, but it is good enough for many applications, and doesn't require fitting process. 它不保证唯一性并且不可逆,但它对于许多应用来说已经足够好了,并且不需要装配过程。
For column that won't be used as a feature you can also use hash
function: 对于不会用作功能的列,您还可以使用
hash
函数:
import org.apache.spark.sql.functions.hash
dataset.withColumn("cid_hash", hash($"cid"))
Assuming that: 假如说:
cid
as a feature (after StringIndexer + OneHotEncoderEstimator) cid
用作功能(在StringIndexer + OneHotEncoderEstimator之后) A few questions first: 先问几个问题:
cid
column? cid
列中有多少个不同的值? Without knowing much more, my first guess is that you should not worry about memory now and check your degree of parallelism first. 在不知情的情况下,我的第一个猜测是你现在不应该担心内存并首先检查你的并行度。 You only have 2
R4 2XLarge
instances that will give you: 您只有2个
R4 2XLarge
实例可以为您提供:
Personally, I would try to either: 就个人而言,我会尝试:
R4 2XLarge
instances with others that have more CPUs R4 2XLarge
实例与具有更多CPU的其他实例交换 Unfortunately, with the current EMR offering this can only be achieved by throwing money at the problem: 不幸的是,对于目前的EMR产品,这只能通过在问题上投入资金来实现:
Finally, what's the need to repartition(50)
? 最后,需要
repartition(50)
? That might just introduce further delays... 这可能会引入进一步的延误......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.