简体   繁体   English

Google Cloud 上的 YARN Giraph 应用程序 - 未找到胖 jar

[英]YARN Giraph application on Google Cloud - fat jar not found

I'm trying to run my Giraph-based application on a Hadoop cluster through YARN.我正在尝试通过 YARN 在 Hadoop 集群上运行基于 Giraph 的应用程序。 The command I use is我使用的命令是

yarn jar solver-1.0-SNAPSHOT.jar edu.agh.iga.adi.giraph.IgaSolverTool

First I need to copy that JAR to one of the directories that are reported when issuing yarn classpath .首先,我需要将该 JAR 复制到发出yarn classpath时报告的目录之一。 Just to be sure, changing file privileges to 777.可以肯定的是,将文件权限更改为 777。

I obviously need to ship that JAR to the workers so I do:我显然需要将该 JAR 运送给工人,所以我这样做:

conf.setYarnLibJars(currentJar());

In the code where currentJar() is:currentJar()所在的代码中:

  private static String currentJar() {
    return new File(IgaGiraphJobFactory.class.getProtectionDomain()
        .getCodeSource()
        .getLocation()
        .getPath()).getName();
  }

This users the JAR name which seems to be fine as the application no longer crashes fast (if anything else was used it would).这个用户使用 JAR 名称,这似乎很好,因为应用程序不再快速崩溃(如果使用其他任何东西,它会)。 Instead, it takes around 10 minutes after which a failure is reported.相反,它需要大约 10 分钟才能报告失败。 There is an error in the logs:日志中有一个错误:

LogType:gam-stderr.log
LogLastModifiedTime:Sat Sep 14 13:24:52 +0000 2019
LogLength:2122
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/kbhit/appcache/application_1568451681492_0016/filecache/11/solver-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "pool-6-thread-2" java.lang.IllegalStateException: Could not configure the containerlaunch context for GiraphYarnTasks.
    at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:391)
    at org.apache.giraph.yarn.GiraphApplicationMaster.access$500(GiraphApplicationMaster.java:78)
    at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.buildContainerLaunchContext(GiraphApplicationMaster.java:522)
    at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.run(GiraphApplicationMaster.java:479)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0016/solver-1.0-SNAPSHOT.jar
    at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1533)
    at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1526)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1541)
    at org.apache.giraph.yarn.YarnUtils.addFileToResourceMap(YarnUtils.java:153)
    at org.apache.giraph.yarn.YarnUtils.addFsResourcesToMap(YarnUtils.java:77)
    at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:387)
    ... 6 more
End of LogType:gam-stderr.log.This log file belongs to a running container (container_1568451681492_0016_01_000001) and so may not be complete.

Which causes class not found errors (GiraphYarnTask) in the worker containers.这会导致在工作容器中找不到类错误 (GiraphYarnTask)。

Seems that for some reason the JAR doesn't get transferred to HDFS along with the config (which is).似乎由于某种原因,JAR 没有与配置(即)一起传输到 HDFS。 What might be the reason for that?原因可能是什么?

Also, it seems that the JAR is getting sent此外,似乎 JAR 正在发送

1492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73672704
2019-09-14 14:08:26,252 DEBUG [DFSOutputStream] - enqueue full packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, bytesCurBlock=73737216, blockSize=134217728, appendChunk=false, blk_1073741905_1081@[DatanodeInfoWithStorage[10.164.0.6:9866,DS-2d8f815f-1e64-4a7f-bbf6-0c91ebc613d7,DISK], DatanodeInfoWithStorage[10.164.0.7:9866,DS-6a606f45-ffb7-449f-ab8b-57d5950d5172,DISK]]
2019-09-14 14:08:26,252 DEBUG [DataStreamer] - Queued packet 1142
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216
2019-09-14 14:08:26,253 DEBUG [DFSClient] - computePacketChunkSize: src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, chunkSize=516, chunksPerPacket=126, packetSize=65016
2019-09-14 14:08:26,253 DEBUG [DFSClient] - DFSClient writeChunk allocating new packet seqno=1143, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73737216
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1141 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 323347 flag: 0 flag: 0
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1142 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 326916 flag: 0 flag: 0
2019-09-14 14:08:26,254 DEBUG [DataStreamer] - Queued packet 1143
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1143 offsetInBlock: 73737216 lastPacketInBlock: false lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - Queued packet 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - Waiting for ack for: 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DFSClient seqno: 1143 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 497613 flag: 0 flag: 0
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1144 offsetInBlock: 73771432 lastPacketInBlock: true lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - DFSClient seqno: 1144 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2406978 flag: 0 flag: 0
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - Closing old block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081
2019-09-14 14:08:26,264 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #12 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2019-09-14 14:08:26,266 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #12
2019-09-14 14:08:26,267 DEBUG [ProtobufRpcEngine] - Call: complete took 4ms
2019-09-14 14:08:26,267 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #13 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo
2019-09-14 14:08:26,268 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #13
2019-09-14 14:08:26,268 DEBUG [ProtobufRpcEngine] - Call: getFileInfo took 1ms
2019-09-14 14:08:26,269 INFO  [YarnUtils] - Registered file in LocalResources :: hdfs://iga-adi-m/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar

but once I inspect the contents it's empty但是一旦我检查了内容,它就是空的

2019-09-14 14:16:42,795 DEBUG [ProtobufRpcEngine] - Call: getListing took 6ms
Found 1 items
-rw-r--r--   2 yarn hadoop     187800 2019-09-14 14:08 hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0021/giraph-conf.xml

Meanwhile, if I just copy manually the jar to that directory (predicting it's name) everything works as expected.同时,如果我只是将 jar 手动复制到该目录(预测它的名称),一切都会按预期工作。 What is wrong?怎么了?

I think it might be connected to this GIRAPH-859我认为它可能与这个GIRAPH-859 有关

It seems that even if Giraph maintainers claim that it can run in YARN mode it is not really true.似乎即使 Giraph 维护者声称它可以在 YARN 模式下运行,但事实并非如此。 There are a number of bugs which make it difficult unless you know what is the root cause, like in this case.除非您知道根本原因是什么,否则有许多错误会使其变得困难,例如在这种情况下。

The cause here is that when Giraph is sending the jars to the HDFS from where there should be accessible to the workers it uses one location to upload and another to download, hence workers cannot find the file.这里的原因是,当 Giraph 将 jar 发送到 HDFS 时,工作人员应该可以访问该位置,它使用一个位置上传,另一个位置下载,因此工作人员找不到该文件。 This happens if we use a user different than yarn to launch the application - probably a fairly common assumption.如果我们使用不同于yarn的用户来启动应用程序,就会发生这种情况——这可能是一个相当普遍的假设。

There are 3 workarounds, neither is ideal (some might not be applicable):有 3 种解决方法,但都不理想(有些可能不适用):

  • just to run the application using yarn user只是为了使用纱线用户运行应用程序
  • upload the jars manually before each computation (note that you have to make sure you are uploading to the new directory (just increment the job number) - also remember that you have to create that directory first在每次计算之前手动上传 jars(请注意,您必须确保上传到新目录(只需增加作业编号) - 还请记住,您必须先创建该目录
  • apply this patch and build against this version of Giraph应用此补丁并针对此版本的 Giraph 进行构建

Tested all three, all work.三个都测试了,都可以。

I got a similar error:我遇到了类似的错误:

    20/03/04 09:40:10 ERROR yarn.GiraphYarnTask: GiraphYarnTask threw a top-level exception, failing task
java.lang.RuntimeException: run() caught an unrecoverable IOException.
    at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:97)
    at org.apache.giraph.yarn.GiraphYarnTask.main(GiraphYarnTask.java:183)
Caused by: java.io.FileNotFoundException: File hdfs://localhost:9000/user/schramml/_bsp/_defaultZkManagerDir/giraph_yarn_application_1583310839052_0001 does not exist.
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:118)
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1050)
    at org.apache.giraph.zk.ZooKeeperManager.getServerListFile(ZooKeeperManager.java:346)
    at org.apache.giraph.zk.ZooKeeperManager.getZooKeeperServerList(ZooKeeperManager.java:376)
    at org.apache.giraph.zk.ZooKeeperManager.setup(ZooKeeperManager.java:190)
    at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:449)
    at org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:251)
    at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:91)
    ... 1 more

But the reason in my case was that I used an aggregatorWriter and had to delete the file of the Writer from the previous run.但就我而言,原因是我使用了 aggregatorWriter 并且不得不从上次运行中删除 Writer 的文件。 There was also an file already exist error in another container, but at first I found this question and maybe this information helps someone else.在另一个容器中也有一个file already exist error ,但起初我发现了这个问题,也许这个信息对其他人有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM