[英]Apache Spark Streaming Custom Receiver(Text File) using Java
I'm new to Apache Spark. 我是Apache Spark的新手。
I need to read the log files from local/mounted directory. 我需要从本地/挂载目录中读取日志文件。 Some external source writing the files into local/mounted directory.
一些外部源将文件写入本地/挂载目录。 Eg External source writing logs into
combined_file.txt
file and once file writing completed the external source create new file with prefix 0_ , like 0_combined_file.txt
. 例如,外部源写入将记录到
combined_file.txt
文件中,一旦文件写入完成,外部源将创建前缀为0_的新文件,例如0_combined_file.txt
。 Then i need to read the combined_file.txt
log file and process it. 然后,我需要阅读
combined_file.txt
日志文件并进行处理。 So I'm trying to write the custom receiver to check log file writing into local/mounted directory is completed and then read the completed file. 因此,我试图编写自定义接收器,以检查是否已完成将日志文件写入本地/挂载目录的操作,然后读取已完成的文件。
Here is my code 这是我的代码
@Override
public void onStart() {
Runnable th = () -> {
while (true) {
try {
Thread.sleep(1000l);
File dir = new File("/home/PK01/Desktop/arcflash/");
File[] completedFiles = dir.listFiles((dirName, fileName) -> {
return fileName.toLowerCase().startsWith("0_");
});
//metaDataFile --> 0_test.txt
//completedFiles --> test.txt
for (File metaDataFile : completedFiles) {
String compFileName = metaDataFile.getName();
compFileName = compFileName.substring(2, compFileName.length());
File dataFile = new File("/home/PK01/Desktop/arcflash/" + compFileName);
if (dataFile.exists()) {
byte[] data = new byte[(int) dataFile.length()];
fis.read(data);
fis.close();
store(new String(data));
dataFile.delete();
metaDataFile.delete();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
};
new Thread(th);
}
I'm trying to process the data like below. 我正在尝试处理如下数据。
JavaReceiverInputDStream<String> data = jssc.receiverStream(receiver);
data.foreachRDD(fileStreamRdd -> {
processOnSingleFile(fileStreamRdd.flatMap(streamBatchData -> {
return Arrays.asList(streamBatchData.split("\\n")).iterator();
}));
});
But getting below exception 但要低于例外
18/01/19 12:08:39 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
18/01/19 12:08:39 WARN BlockManager: Block input-0-1516343919400 replicated to only 0 peer(s) instead of 1 peers
18/01/19 12:08:40 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:60)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:91)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:308)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:308)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/01/19 12:08:40 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1,5,main]
java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:60)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:91)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:308)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:308)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/01/19 12:08:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:60)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:91)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:308)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:308)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Anybody can help me to resolve the error here. 任何人都可以帮助我在这里解决错误。
Any help will be appreciate 任何帮助将不胜感激
18/01/19 12:08:40 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1,5,main] java.lang.OutOfMemoryError: Java heap space 19/18/19 12:08:40错误SparkUncaughtExceptionHandler:线程Thread中的未捕获异常[任务1,5,main的执行器任务启动工作器] java.lang.OutOfMemoryError:Java堆空间
The above show that you are hitting an out of memory error.Increase the memory explicitly while submitting the spark job 以上显示您遇到了内存不足错误。在提交Spark作业时显式增加内存
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.