简体   繁体   中英

Koalas applymap moving all data to a single partition

I need to do element-wise operation on a Koalas DataFrame. I use for that the Koalas applymap method. On the execution Koalas moves all data to one partition and then applies the operation. The outcome is that the performance of the job is very poor.

>>> sdf = spark.range(0, 10**7, 1, 10).toDF('col1').withColumn('col2', F.lit('[1,2]'))

>>> kdf = ks.DataFrame(sdf)

>>> kdf_new = kdf[['col2']].applymap(eval)

WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

How to force Koalas not to shuffle data and apply the operation in existing partitions?

You can fix this by specifying an index on the Koalas DataFrame. The default index is expected to give poor performance. Read about default index type in Koalas.

Either specify a different default index

ks.options.compute.default_index_type = 'distributed-sequence'

or specify a specific index (ie don't use default) on your DataFrame

kdf = kdf.set_index('col1')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM