简体   繁体   中英

Is there a way to set a minimum batch size for a pandas_udf in PySpark?

I am using a pandas_udf to apply a machine learning model on my spark cluster and am interested in predefining the minimum number of records sent via arrow to the UDF.

I followed the databricks tutorial for the bulk of the UDF... https://docs.databricks.com/applications/deep-learning/inference/resnet-model-inference-tensorflow.html

From the tutorial, I set the spark conference to have a maximum batch size and enabled arrow. I can easily set the maximum batch size however I was wondering if there is a similar method for setting a minimal batch size that the UDF will handle?

spark = SparkSession.builder.appName('App').getOrCreate()

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', PyArrowBatchSize)

I am running spark version 2.4.3 and python 3.6.0.

There is no way to set the minimum batch size in the Spark docs , but in this case max is a bit misleading. This should be something like "batch size before remainder".

Ex: If you have 100132 rows in your dataset, and your maxRecordsPerBatch is 10000, then you will get 10 batches of size 10000, and one batch of size 132 as the remainder. (If you have multiple executors, you may have additional batches with remainders, depending on how things are split up.)

You can know that your approximate min batch size is dependent of your remainder, and otherwise all batch sizes will be exactly min batch size.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM