简体   繁体   English

如何使用mapPartitions在RDD的分区上运行python用户定义的函数?

[英]How to run a python user-defined function on the partitions of RDDs using mapPartitions?

I'm trying to run a python UDF on the partitions of RDDs. 我正在尝试在RDD的分区上运行python UDF。 Here is how I create rdd: 这是我创建rdd的方法:

text_file = open("/home/zeinab/Desktop/inputFile.txt", "r")
    lines = text_file.read().strip().split("\n")
    linestofloat = []
    for l in lines:
        linestofloat.append(float(l))
    linestofloat = np.array(linestofloat)
    data = sc.parallelize(linestofloat)

The format of the input text file looks like this: 输入文本文件的格式如下:

26.000000 26.000000

-8.000000 -8.000000

-28.000000 -28.000000

-6.000000 -6.000000

-18.000000 -18.000000

... ...

And the function which I'm trying to run using mapPartitons is as follows: 我尝试使用mapPartitons运行的功能如下:

def classic_sta_lta_py(a, nsta, nlta):
    """
    Computes the standard STA/LTA from a given input array a. The length of
    the STA is given by nsta in samples, respectively is the length of the
    LTA given by nlta in samples. Written in Python.

    .. note::

        There exists a faster version of this trigger wrapped in C
        called :func:`~obspy.signal.trigger.classic_sta_lta` in this module!

    :type a: NumPy :class:`~numpy.ndarray`
    :param a: Seismic Trace
    :type nsta: int
    :param nsta: Length of short time average window in samples
    :type nlta: int
    :param nlta: Length of long time average window in samples
    :rtype: NumPy :class:`~numpy.ndarray`
    :return: Characteristic function of classic STA/LTA
    """
    # The cumulative sum can be exploited to calculate a moving average (the
    # cumsum function is quite efficient)
    print("Hello!!!")
    #a =[x for x in floatelems.toLocalIterator()]
    #a = np.array(a)
    print("a array is: {} ".format(a))
    sta = np.cumsum(a ** 2)
    #print("{}. sta array is: ".format(sta))


    # Convert to float
    sta = np.require(sta, dtype=np.float)

    # Copy for LTA
    lta = sta.copy()

    # Compute the STA and the LTA
    sta[nsta:] = sta[nsta:] - sta[:-nsta]
    sta /= nsta
    lta[nlta:] = lta[nlta:] - lta[:-nlta]
    lta /= nlta

    # Pad zeros
    sta[:nlta - 1] = 0

    # Avoid division by zero by setting zero values to tiny float
    dtiny = np.finfo(0.0).tiny
    idx = lta < dtiny
    lta[idx] = dtiny

    return sta / lta

But I keep getting the following error when I run the following line: 但是,当我运行以下行时,我仍然收到以下错误:

stalta_ratio = data.mapPartitions(lambda i: classic_sta_lta_py(i, 2, 30))

Error: 错误:

TypeError: unsupported operand type(s) for ** or pow(): 'itertools.chain' and 'int'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

Does anyone know what I am doing wrong? 有人知道我在做什么错吗?

Thank you. 谢谢。

The type of parameter you get in your lambda inside mapPartitions is iterator, but looking on your function documentation you need numpy.ndarray there. 您在mapPartitions内的lambda获得的参数类型是迭代器,但是在函数文档中查找时,您需要numpy.ndarray那里的numpy.ndarray You can convert it easily if your dataset is small enough to be handler by one executor. 如果您的数据集足够小,可以由一位执行者处理,则可以轻松地对其进行转换。 Try this one: 试试这个:

data.mapPartitions(
    lambda i: classic_sta_lta_py(np.ndarray(list(i)), 2, 30)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM