[英]Gobblin Map-reduce job running successfully on EMR but no output in s3
I am running gobblin to move data from kafka to s3 using 3 node EMR cluster.我正在运行 gobblin 以使用 3 节点 EMR 集群将数据从 kafka 移动到 s3。 I am running on hadoop 2.6.0 and I also built gobblin against 2.6.0.我在 hadoop 2.6.0 上运行,我还针对 2.6.0 构建了 gobblin。
It seems like map-reduce job runs successfully.似乎 map-reduce 作业运行成功。 On my hdfs i see metrics and working directory.在我的 hdfs 上,我看到了指标和工作目录。 metrics has some files but working directory is empty.指标有一些文件,但工作目录是空的。 S3 bucket should have had final output but has no data. S3 存储桶应该有最终输出但没有数据。 And at the end it says最后它说
Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3输出任务状态路径/gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498不存在删除工作目录/gooblinOutput/working/GobblinKafkaQuickStart_mapR3
Here are final logs :以下是最终日志:
2016-04-08 16:23:26 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1366 - Job job_1460065322409_0002 running in uber mode : false
2016-04-08 16:23:26 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 0% reduce 0%
2016-04-08 16:23:32 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 10% reduce 0%
2016-04-08 16:23:33 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 40% reduce 0%
2016-04-08 16:23:34 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 60% reduce 0%
2016-04-08 16:23:36 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 80% reduce 0%
2016-04-08 16:23:37 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 100% reduce 0%
2016-04-08 16:23:38 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1384 - Job job_1460065322409_0002 completed successfully
2016-04-08 16:23:38 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1391 - Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1276095
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=28184
HDFS: Number of bytes written=41960
HDFS: Number of read operations=60
HDFS: Number of large read operations=0
HDFS: Number of write operations=11
Job Counters
Launched map tasks=10
Other local map tasks=10
Total time spent by all maps in occupied slots (ms)=1828125
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=40625
Total vcore-seconds taken by all map tasks=40625
Total megabyte-seconds taken by all map tasks=58500000
Map-Reduce Framework
Map input records=10
Map output records=0
Input split bytes=2150
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=296
CPU time spent (ms)=10900
Physical memory (bytes) snapshot=2715054080
Virtual memory (bytes) snapshot=18852671488
Total committed heap usage (bytes)=4729077760
File Input Format Counters
Bytes Read=6444
File Output Format Counters
Bytes Written=0
2016-04-08 16:23:38 UTC INFO [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService 101 - Stopping the TaskStateCollectorService
2016-04-08 16:23:38 UTC WARN [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService 123 - Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist
2016-04-08 16:23:38 UTC INFO [main] gobblin.runtime.mapreduce.MRJobLauncher 443 - Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3
2016-04-08 16:23:38 UTC INFO [main] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-04-08 16:23:38 UTC INFO [main] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-04-08 16:23:38 UTC INFO [main] gobblin.runtime.app.ServiceBasedAppLauncher 158 - Shutting down the application
2016-04-08 16:23:38 UTC INFO [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
2016-04-08 16:23:38 UTC INFO [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
2016-04-08 16:23:38 UTC WARN [Thread-7] gobblin.runtime.app.ServiceBasedAppLauncher 153 - ApplicationLauncher has already stopped
2016-04-08 16:23:38 UTC WARN [Thread-4] gobblin.metrics.reporter.ContextAwareReporter 116 - Reporter MetricReportReporter has already been stopped.
2016-04-08 16:23:38 UTC WARN [Thread-4] gobblin.metrics.reporter.ContextAwareReporter 116 - Reporter MetricReportReporter has already been stopped.
Here are my conf files :这是我的 conf 文件:
gobblin-mapreduce.properties
# Thread pool settings for the task executor
taskexecutor.threadpool.size=2
taskretry.threadpool.coresize=1
taskretry.threadpool.maxsize=2
# File system URIs
fs.uri=hdfs://{host}:8020
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://{bucket}/gobblin-mapr/
# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output
# Data publisher related configuration properties
data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
data.publisher.replace.final.dir=false
# Directory where job/task state files are stored
state.store.dir=${env:GOBBLIN_WORK_DIR}/state-store
# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err
# Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks
# Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics
# Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000
# MapReduce properties
mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working
# s3 bucket configuration
data.publisher.fs.uri=s3a://{bucket}/gobblin-mapr/
fs.s3a.access.key={key}
fs.s3a.secret.key={key}
FILE 2 : kafka-to-s3.pull文件 2:kafka-to-s3.pull
job.name=GobblinKafkaQuickStart_mapR3
job.group=GobblinKafka_mapR3
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false
kafka.brokers={kafka-host}:9092
topic.whitelist={topic_name}
source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka
writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt
data.publisher.type=gobblin.publisher.BaseDataPublisher
mr.job.max.mappers=10
bootstrap.with.offset=latest
metrics.reporting.file.enabled=true
metircs.enabled=true
metrics.reporting.file.suffix=txt
Running commands运行命令
export GOBBLIN_WORK_DIR=/gooblinOutput
Command : bin/gobblin-mapreduce.sh --conf /home/hadoop/gobblin-files/gobblin-dist/kafkaConf/kafka-to-s3.pull --logdir /home/hadoop/gobblin-files/gobblin-dist/logs
Not sure whats going on.不知道发生了什么。 Can someone please help?有人可以帮忙吗?
There were 2 issues有 2 个问题
I had data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output我有 data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
It should have been something like s3a://dev.com/gobblin-mapr6/它应该类似于 s3a://dev.com/gobblin-mapr6/
and then I somehow got special characters added to topic.whitelist.然后我以某种方式将特殊字符添加到了 topic.whitelist。 So, it was not able to recognize the topic所以,它无法识别主题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.