Koalas applymap moving all data to a single partition

Question

I need to do element-wise operation on a Koalas DataFrame. I use for that the Koalas applymap method. On the execution Koalas moves all data to one partition and then applies the operation. The outcome is that the performance of the job is very poor.

>>> sdf = spark.range(0, 10**7, 1, 10).toDF('col1').withColumn('col2', F.lit('[1,2]'))

>>> kdf = ks.DataFrame(sdf)

>>> kdf_new = kdf[['col2']].applymap(eval)

WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

How to force Koalas not to shuffle data and apply the operation in existing partitions?

Answer 1

You can fix this by specifying an index on the Koalas DataFrame. The default index is expected to give poor performance. Read about default index type in Koalas.

Either specify a different default index

ks.options.compute.default_index_type = 'distributed-sequence'

or specify a specific index (ie don't use default) on your DataFrame

kdf = kdf.set_index('col1')

Koalas applymap moving all data to a single partition

Question

1 answers

solution1
1 ACCPTED 2021-05-26 23:41:14

Koalas applymap moving all data to a single partition

Question

1 answers

solution1 1 ACCPTED 2021-05-26 23:41:14

solution1
1 ACCPTED 2021-05-26 23:41:14