[英]How to define a file filter for file name patterns in Apache Spark Streaming in Java?
I'm using Apache Spark Streaming 1.2.0 and trying to define a file filter for file names when creating an InputDStream by invoking the fileStream method. 我正在使用Apache Spark Streaming 1.2.0,并在通过调用fileStream方法创建InputDStream时尝试为文件名定义文件过滤器。 My code is working perfectly fine when I don't use a file filter, eg by invoking the other fileStream method (described here ).
当我不使用文件过滤器时,例如通过调用其他fileStream方法( 在此进行描述),我的代码运行良好。
According to the documentation of fileStream method, I can pass it 根据fileStream方法的文档,我可以通过它
scala.Function1<org.apache.hadoop.fs.Path,Object> filter
But so far, I could not create a fileFilter
. 但是到目前为止,我还不能创建
fileFilter
。 My initial attempts have been 我最初的尝试是
1- Tried to implement it as: 1-尝试将其实现为:
Function1<Path, Object> fileFilter = new Function1<Path, Object>() {
@Override
public Object apply(Path v1) {
return true;
}
@Override
public <A> Function1<A, Object> compose(Function1<A, Path> g) {
return Function1$class.compose(this, g);
}
@Override
public <A> Function1<Path, A> andThen(Function1<Object, A> g) {
return Function1$class.andThen(this, g);
}
};
But apparently my implementation of andThen
is wrong, and I couldn't understand how I should implement it. 但是显然我对
andThen
实现是错误的,而且我不明白我应该如何实现它。 It complains that the anonymous function 它抱怨说匿名功能
is not abstract and does not override abstract method <A>andThen$mcVJ$sp(scala.Function1<scala.runtime.BoxedUnit,A>) in scala.Function1
2- Tried to implement it as: 2-尝试将其实现为:
Function1<Path, Object> fileFilter = new AbstractFunction1<Path, Object>() {
@Override
public Object apply(Path v1) {
return true;
}
};
This one compiles but then when I run it I get an exception: 这个编译,但是当我运行它时,我得到一个异常:
2015-02-02 13:42:50 ERROR OneForOneStrategy:66 - myModule$1
java.io.NotSerializableException: myModule$1
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:169)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)
at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184)
at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:263)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:74)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Any ideas how I can implement a fileFilter so that I can pass it fileStream
method, so that I can make Spark Streaming process only the file name patterns I want? 关于如何实现fileFilter以便可以将其传递给
fileStream
方法的任何想法,这样我就可以使Spark Streaming仅处理所需的文件名模式?
I had to create another file named FileFilter.java: 我必须创建另一个名为FileFilter.java的文件:
import org.apache.hadoop.fs.Path;
import scala.runtime.AbstractFunction1;
import java.io.Serializable;
public class FileFilter extends AbstractFunction1<Path, Object> implements Serializable {
@Override
public Object apply(Path v1) {
if ( v1.toString().endsWith((".json")) ) {
return Boolean.TRUE;
} else {
return Boolean.FALSE;
}
}
}
And then pass it to the fileStream method as in: 然后将其传递给fileStream方法,如下所示:
fileStream(inDirectory, new FileFilter(), false, ...)
And it worked without any problems. 它的工作没有任何问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.