繁体   English   中英

在RDD上使用python类方法

[英]using python class methods on RDD

我的问题可能听起来有点类似这个这个 ,但想这些也没有帮我摆脱困境的解决方案。
我有一个定义为的类标记器-

class Tokenizer:
    def __init__(self, preserve_case=False):
        self.preserve_case = preserve_case

    def tokenize(self, s):
        """
        Argument: s -- any string or unicode object
        Value: a tokenize list of strings; conatenating this list returns the original string if preserve_case=False
        """        
        # Try to ensure unicode:
        try:
            s = str(s)
        except UnicodeDecodeError:
            s = s.encode('string_escape')
            s = str(s)
        # Fix HTML character entitites:
        s = self.__html2unicode(s)
        # Tokenize:
        words = word_re.findall(s)
        # Possible alter the case, but avoid changing emoticons like :D into :d:
        if not self.preserve_case:            
            words = map((lambda x : x if emoticon_re.search(x) else x.lower()), words)
        return words
tok=Tokenizer(preserve_case=False)

我有(user_id,tweets)的(键,值)RDD。 我想在类tokenizer的功能tokenize上使用RDD的tweet。 我所做的是-

rdd.foreach(lambda x:tok.tokenize(x[1])).take(5)  

并得到了错误-

'NoneType'对象没有属性'take'

我也尝试过

rdd1.map(lambda x:tok.tokenize(x[1])).take(5)  

并得到了错误-

()----> 1 rdd1.map(lambda x:tok.tokenize(x 1 ))。take(5)中的Py4JJavaError Traceback(最近一次通话最近)

〜/ anaconda3 / lib / python3.6 / site-packages / pyspark / rdd.py in take((自我,num)1358 1359 p = range(partsScanned,min(partsScanned + numPartsToTry,totalParts))-> 1360 res =自我。 context.runJob(self,takeUpToNumLeft,p)1361 1362项+ = res

〜/ anaconda3 / lib / python3.6 / site-packages / pyspark / context.py在runJob中(self,rdd,partitionFunc,partitions,allowLocal)1067

SparkContext#runJob。 正文_第1068章

-> 1069 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区)1070返回列表(_load_from_socket(sock_info,mappedRDD._jrdd_deserializer))
1071

〜/ anaconda3 / lib / python3.6 / site-packages / py4j / java_gateway.py在调用中 (self,* args)1255 answer = self.gateway_client.send_command(command)1256 return_value = get_return_value(-> 1257 answer,self。 gateway_client,self.target_id,self.name)1258 1259 for temp_args中的temp_arg:

〜/ anaconda3 / lib / python3.6 / site-packages / py4j / protocol.py in get_return_value(answer,gateway_client,target_id,name)326引发Py4JJavaError(327“调用{0} {1} {2}时发生错误。\\ n“。-> 328格式(target_id,“。”,名称),值)329其他:330提高Py4JError(

Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.runJob时发生错误。 :org.apache.spark.SparkException:由于阶段失败而导致作业中止:39.0阶段中的任务0失败1次,最近一次失败:39.0阶段中丢失了任务0.0(TID 101,本地主机,执行程序驱动程序):org.apache.spark .api.python.PythonException:追溯(最近一次通话):文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py” ,在主process()文件的第377行中,文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,在第372行中serializer.dump_stream(func(split_index,iterator),outfile)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,行397,以dump_stream个字节为单位= self.serializer.dumps(vs)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,第576行,在转储中返回pickle.dumps(obj,protocol)AttributeError:无法腌制本地对象'Tokenizer.tokenize ..'

在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)在org org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)上的.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)在org.apache上在scala的.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)在scala.collection.Iterator $ class.foreach(Iterator.scala:891)在org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)在scala。在scala.collection.mutable.ArrayBuffer处的collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59)在scala.collection.mutable处的collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59) ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:48)在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)在org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)在scala.collection.Traversab org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)的leOnce $ class.toBuffer(TraversableOnce.scala:302)在scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)的org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) org.apache.spark.api.python.PythonRDD上的.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)在org.apache.spark.api.python.PythonRDD上的$ .anonfun $ 3.apply(PythonRDD.scala:153) org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2101)的org.apache.spark.SparkContext $$ anonfun $ run。$ 5的$$ anonfun $ 3.apply(PythonRDD.scala:153)在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在org.aply.spark.scheduler.Task.run(Task.scala:121)在org.aply.SparkContext.scala:2101) apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)在org.apache.spark.executor .Executor $ TaskRunner.run(Executor.scala:414)在java.util.concurrent.ThreadPoo lExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)

驱动程序堆栈跟踪:位于org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1889),位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1。在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1876)处应用(DAGScheduler.scala:1877)在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala: 59)在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)的scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在org.apache.spark.scheduler.DAGScheduler $$ org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)上的anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)在scala.Option.foreach(Option.scala:257) ),位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.org)上的org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)。 在org.apache.spark.util.org上的org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)上的scala:2110) .org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)上的.EventLoop $$ anon $ 1.run(EventLoop.scala:49)在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) )在org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在org.apache.spark.api.python.PythonRDD $ .runJob (PythonRDD.scala:153)在org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java :62)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在java.lang.reflect.Method.invoke(Method.java:498)在py4j.reflection.MethodInvoker.invoke(Meth odInvoker.java:244)位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)位于py4j.Gateway.invoke(Gateway.java:282)位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) py4j.GatewayConnection.run(GatewayConnection.java:238)上的py4j.commands.CallCommand.execute(CallCommand.java:79)在java.lang.Thread.run(Thread.java:748)上的原因:org.apache.spark .api.python.PythonException:追溯(最近一次通话):文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py” ,在主process()文件的第377行中,文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,在第372行中serializer.dump_stream(func(split_index,iterator),outfile)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,行397,以dump_stream字节为单位= self.serializer.dumps(vs)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark /serializers.py“,第576行,在转储中返回pickle.dumps(obj,protocol)AttributeError:无法腌制本地对象'Tokenizer.tokenize ..'

在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)在org org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)上的.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)在org.apache上在scala的.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)在scala.collection.Iterator $ class.foreach(Iterator.scala:891)在org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)在scala。在scala.collection.mutable.ArrayBuffer处的collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59)在scala.collection.mutable处的collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59) ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:48)在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)在org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)在scala.collection.Traversab org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)的leOnce $ class.toBuffer(TraversableOnce.scala:302)在scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)的org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) org.apache.spark.api.python.PythonRDD上的.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)在org.apache.spark.api.python.PythonRDD上的$ .anonfun $ 3.apply(PythonRDD.scala:153) org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2101)的org.apache.spark.SparkContext $$ anonfun $ run。$ 5的$$ anonfun $ 3.apply(PythonRDD.scala:153)在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在org.aply.spark.scheduler.Task.run(Task.scala:121)在org.aply.SparkContext.scala:2101) apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)在org.apache.spark.executor .Executor $ TaskRunner.run(Executor.scala:414)在java.util.concurrent.ThreadPoo lExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)...还有1个

任何帮助将不胜感激。 提前致谢!

rdd.foreach(lambda x:tok.tokenize(x[1])).take(5)

在这里,您尝试访问rdd.foreach()的结果,该结果为null。

rdd1.map(lambda x:tok.tokenize(x[1])).take(5)

在这里,您使用带有lambda的自定义对象,这将引发下一个异常:

AttributeError:无法腌制本地对象“ Tokenizer.tokenize ..”

这实际上意味着pyspark无法序列化Tokenizer.tokenize方法。 一种可能的解决方案是从一个函数调用tok.tokenize(x[1]) ,然后在map中将对该函数的引用传递给该函数,如下所示:

def tokenize(x):
  return tok.tokenize(x[0])

rdd1.map(tokenize).take(5)

同样在您的代码中,您还有另外一个问题。 self.__html2unicode(s)正在尝试访问未声明的self.__html2unicode(s)方法。 这将导致以下错误:

AttributeError: 'Tokenizer' object has no attribute '_Tokenizer__html2unicode'

相关话题

PySpark:PicklingError:无法序列化对象:TypeError:无法腌制CompiledFFI对象

https://github.com/yahoo/TensorFlowOnSpark/issues/198

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM