简体   繁体   English

Gobblin Map-reduce 作业在 EMR 上成功运行,但在 s3 中没有输出

[英]Gobblin Map-reduce job running successfully on EMR but no output in s3

I am running gobblin to move data from kafka to s3 using 3 node EMR cluster.我正在运行 gobblin 以使用 3 节点 EMR 集群将数据从 kafka 移动到 s3。 I am running on hadoop 2.6.0 and I also built gobblin against 2.6.0.我在 hadoop 2.6.0 上运行,我还针对 2.6.0 构建了 gobblin。

It seems like map-reduce job runs successfully.似乎 map-reduce 作业运行成功。 On my hdfs i see metrics and working directory.在我的 hdfs 上,我看到了指标和工作目录。 metrics has some files but working directory is empty.指标有一些文件,但工作目录是空的。 S3 bucket should have had final output but has no data. S3 存储桶应该有最终输出但没有数据。 And at the end it says最后它说

Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3输出任务状态路径/gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498不存在删除工作目录/gooblinOutput/working/GobblinKafkaQuickStart_mapR3

Here are final logs :以下是最终日志:

2016-04-08 16:23:26 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1366 -      Job job_1460065322409_0002 running in uber mode : false
2016-04-08 16:23:26 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 0% reduce 0%
2016-04-08 16:23:32 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 10% reduce 0%
2016-04-08 16:23:33 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 40% reduce 0%
2016-04-08 16:23:34 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 60% reduce 0%
2016-04-08 16:23:36 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 80% reduce 0%
2016-04-08 16:23:37 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 100% reduce 0%
2016-04-08 16:23:38 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1384 -      Job job_1460065322409_0002 completed successfully
2016-04-08 16:23:38 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1391 -      Counters: 30
    File System Counters
     FILE: Number of bytes read=0
    FILE: Number of bytes written=1276095
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=28184
    HDFS: Number of bytes written=41960
    HDFS: Number of read operations=60
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=11
Job Counters
    Launched map tasks=10
    Other local map tasks=10
    Total time spent by all maps in occupied slots (ms)=1828125
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=40625
    Total vcore-seconds taken by all map tasks=40625
    Total megabyte-seconds taken by all map tasks=58500000
Map-Reduce Framework
    Map input records=10
    Map output records=0
    Input split bytes=2150
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=296
    CPU time spent (ms)=10900
    Physical memory (bytes) snapshot=2715054080
    Virtual memory (bytes) snapshot=18852671488
    Total committed heap usage (bytes)=4729077760
File Input Format Counters
    Bytes Read=6444
File Output Format Counters
    Bytes Written=0
2016-04-08 16:23:38 UTC INFO  [TaskStateCollectorService STOPPING]    gobblin.runtime.TaskStateCollectorService  101 - Stopping the    TaskStateCollectorService
2016-04-08 16:23:38 UTC WARN  [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService  123 - Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist
2016-04-08 16:23:38 UTC INFO  [main] gobblin.runtime.mapreduce.MRJobLauncher  443 - Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3
2016-04-08 16:23:38 UTC INFO  [main] gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
 2016-04-08 16:23:38 UTC INFO  [main] gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]  
 2016-04-08 16:23:38 UTC INFO  [main] gobblin.runtime.app.ServiceBasedAppLauncher  158 - Shutting down the application
 2016-04-08 16:23:38 UTC INFO  [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
 2016-04-08 16:23:38 UTC INFO  [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
 2016-04-08 16:23:38 UTC WARN  [Thread-7] gobblin.runtime.app.ServiceBasedAppLauncher  153 - ApplicationLauncher has already stopped
 2016-04-08 16:23:38 UTC WARN  [Thread-4] gobblin.metrics.reporter.ContextAwareReporter  116 - Reporter MetricReportReporter has already been stopped.
 2016-04-08 16:23:38 UTC WARN  [Thread-4] gobblin.metrics.reporter.ContextAwareReporter  116 - Reporter MetricReportReporter has already been stopped.

Here are my conf files :这是我的 conf 文件:

gobblin-mapreduce.properties

# Thread pool settings for the task executor
taskexecutor.threadpool.size=2
taskretry.threadpool.coresize=1
taskretry.threadpool.maxsize=2

# File system URIs
fs.uri=hdfs://{host}:8020
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://{bucket}/gobblin-mapr/

# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output

# Data publisher related configuration properties 
data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
data.publisher.replace.final.dir=false

# Directory where job/task state files are stored
state.store.dir=${env:GOBBLIN_WORK_DIR}/state-store

# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err

# Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks

# Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics

# Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000

# MapReduce properties
mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working


# s3 bucket configuration

data.publisher.fs.uri=s3a://{bucket}/gobblin-mapr/
fs.s3a.access.key={key}
fs.s3a.secret.key={key}

FILE 2 : kafka-to-s3.pull文件 2:kafka-to-s3.pull

job.name=GobblinKafkaQuickStart_mapR3
job.group=GobblinKafka_mapR3
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false

kafka.brokers={kafka-host}:9092
topic.whitelist={topic_name}

source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt

data.publisher.type=gobblin.publisher.BaseDataPublisher

mr.job.max.mappers=10
bootstrap.with.offset=latest

metrics.reporting.file.enabled=true
metircs.enabled=true
metrics.reporting.file.suffix=txt

Running commands运行命令

export GOBBLIN_WORK_DIR=/gooblinOutput
Command : bin/gobblin-mapreduce.sh --conf /home/hadoop/gobblin-files/gobblin-dist/kafkaConf/kafka-to-s3.pull --logdir /home/hadoop/gobblin-files/gobblin-dist/logs

Not sure whats going on.不知道发生了什么。 Can someone please help?有人可以帮忙吗?

There were 2 issues有 2 个问题

I had data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output我有 data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output

It should have been something like s3a://dev.com/gobblin-mapr6/它应该类似于 s3a://dev.com/gobblin-mapr6/

and then I somehow got special characters added to topic.whitelist.然后我以某种方式将特殊字符添加到了 topic.whitelist。 So, it was not able to recognize the topic所以,它无法识别主题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM