How to run a python user-defined function on the partitions of RDDs using mapPartitions?

Question

I'm trying to run a python UDF on the partitions of RDDs. Here is how I create rdd:

text_file = open("/home/zeinab/Desktop/inputFile.txt", "r")
    lines = text_file.read().strip().split("\n")
    linestofloat = []
    for l in lines:
        linestofloat.append(float(l))
    linestofloat = np.array(linestofloat)
    data = sc.parallelize(linestofloat)

The format of the input text file looks like this:

26.000000

-8.000000

-28.000000

-6.000000

-18.000000

...

And the function which I'm trying to run using mapPartitons is as follows:

def classic_sta_lta_py(a, nsta, nlta):
    """
    Computes the standard STA/LTA from a given input array a. The length of
    the STA is given by nsta in samples, respectively is the length of the
    LTA given by nlta in samples. Written in Python.

    .. note::

        There exists a faster version of this trigger wrapped in C
        called :func:`~obspy.signal.trigger.classic_sta_lta` in this module!

    :type a: NumPy :class:`~numpy.ndarray`
    :param a: Seismic Trace
    :type nsta: int
    :param nsta: Length of short time average window in samples
    :type nlta: int
    :param nlta: Length of long time average window in samples
    :rtype: NumPy :class:`~numpy.ndarray`
    :return: Characteristic function of classic STA/LTA
    """
    # The cumulative sum can be exploited to calculate a moving average (the
    # cumsum function is quite efficient)
    print("Hello!!!")
    #a =[x for x in floatelems.toLocalIterator()]
    #a = np.array(a)
    print("a array is: {} ".format(a))
    sta = np.cumsum(a ** 2)
    #print("{}. sta array is: ".format(sta))


    # Convert to float
    sta = np.require(sta, dtype=np.float)

    # Copy for LTA
    lta = sta.copy()

    # Compute the STA and the LTA
    sta[nsta:] = sta[nsta:] - sta[:-nsta]
    sta /= nsta
    lta[nlta:] = lta[nlta:] - lta[:-nlta]
    lta /= nlta

    # Pad zeros
    sta[:nlta - 1] = 0

    # Avoid division by zero by setting zero values to tiny float
    dtiny = np.finfo(0.0).tiny
    idx = lta < dtiny
    lta[idx] = dtiny

    return sta / lta

But I keep getting the following error when I run the following line:

stalta_ratio = data.mapPartitions(lambda i: classic_sta_lta_py(i, 2, 30))

Error:

TypeError: unsupported operand type(s) for ** or pow(): 'itertools.chain' and 'int'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

Does anyone know what I am doing wrong?

Thank you.

Answer 1

The type of parameter you get in your lambda inside mapPartitions is iterator, but looking on your function documentation you need numpy.ndarray there. You can convert it easily if your dataset is small enough to be handler by one executor. Try this one:

data.mapPartitions(
    lambda i: classic_sta_lta_py(np.ndarray(list(i)), 2, 30)
)

How to run a python user-defined function on the partitions of RDDs using mapPartitions?

Question

1 answers

solution1
0 ACCPTED 2018-07-13 19:47:08

How to run a python user-defined function on the partitions of RDDs using mapPartitions?

Question

1 answers

solution1 0 ACCPTED 2018-07-13 19:47:08

solution1
0 ACCPTED 2018-07-13 19:47:08