简体   繁体   English

Spark SQL没有看到hdfs文件

[英]Spark SQL doesn't see hdfs files

I have a spark application, which running on cluster AWS EMR. 我有一个Spark应用程序,该应用程序在集群AWS EMR上运行。

I've added file to hdfs: 我已将文件添加到hdfs:

javaSparkContext.addFile(filePath, recursive);

File exist on hdfs (logs available: file is readable/executeble/writable), but I can't read information from this file using spark SQL API: 该文件存在于hdfs上(可用日志:文件可读/可执行/可写),但是我无法使用Spark SQL API从该文件读取信息:

 LOGGER.info("Spark working directory: " + path);
 File file = new File(path + "/test.avro");
 LOGGER.info("SPARK PATH:" + file);
 LOGGER.info("read:" + file.canRead());
 LOGGER.info("execute:" + file.canExecute());
 LOGGER.info("write:" + file.canWrite());
 Dataset<Row> load = getSparkSession()
                      .read()
                      .format(AVRO_DATA_BRICKS_LIBRARY)
                      .load(file.getAbsolutePath()); 

There is logs: 有日志:

17/08/07 15:03:25 INFO SparkContext: Added file /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/container_1502118042722_0001_01_000001/test.avro at spark://HOST:PORT/files/test.avro with timestamp 1502118205059
17/08/07 15:03:25 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/container_1502118042722_0001_01_000001/test.avro to /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec/test.avro
17/08/07 15:03:25 INFO AbstractS3Calculator: Spark working directory: /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec
17/08/07 15:03:25 INFO AbstractS3Calculator: SPARK PATH:/mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec/test.avro
17/08/07 15:03:25 INFO AbstractS3Calculator: read:true
17/08/07 15:03:25 INFO AbstractS3Calculator: execute:true
17/08/07 15:03:25 INFO AbstractS3Calculator: write:true

org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://HOST:PORT/mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec/test.avro;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:344)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
    at odh.spark.services.algorithms.calculators.RiskEngineS3Calculator.getInputMembers(RiskEngineS3Calculator.java:76)
    at odh.spark.services.algorithms.calculators.RiskEngineS3Calculator.getMembersDataSets(RiskEngineS3Calculator.java:124)
    at odh.spark.services.algorithms.calculators.AbstractS3Calculator.calculate(AbstractS3Calculator.java:50)
    at odh.spark.services.ProgressSupport.start(ProgressSupport.java:47)
    at odh.spark.services.Engine.startCalculations(Engine.java:102)
    at odh.spark.services.Engine.startCalculations(Engine.java:135)
    at odh.spark.SparkApplication.main(SparkApplication.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)

check if do you have that file in your hdfs : 检查您的hdfs是否有该文件:

hadoop fs -ls /home/spark/ # or your working directory instead of /home/spark hadoop fs -ls /home/spark/ #或您的工作目录而不是/ home / spark

If you have that file on hdfs, it looks like the problem on side of Spark, just follow to instruction in descriptions or update your Spark version to the latest 如果您在hdfs上有该文件,则看起来是Spark方面的问题 ,只需按照说明中的说明进行操作,或将您的Spark版本更新到最新版本

By default all files stores in /user/hadoop/ folder in HDFS. 默认情况下,所有文件都存储在HDFS的/user/hadoop/文件夹中。 (you can use this knowledge and load with this constant , but better - need to use absolute paths) (您可以使用此知识并使用此常量加载,但更好-需要使用绝对路径)

To upload to HDFS and use this files - I've used absolute paths: 要上传到HDFS并使用此文件-我使用了绝对路径:

new Configuration().get("fs.defaultFS")//get HDFS root
....
 FileSystem hdfs = getHdfsFileSystem();
 hdfs.copyFromLocalFile(true, true, new Path(srcLocalPath), new Path(destHdfsPath));

Where destHdfsPath - absolute path ( like 'hdfs://...../test.avro' ) 其中destHdfsPath绝对路径(例如'hdfs://...../test.avro'

And then you available to load this information from HDFS: 然后您可以从HDFS加载此信息:

return getSparkSession()
                .read()
                .format(AVRO_DATA_BRICKS_LIBRARY)
                .load(absoluteFilePath);

NOTE : meybe need to add some permissions: FileUtil.chmod(hdfsDest, "u+rw,g+rw,o+rw"); 注意 :我需要添加一些权限: FileUtil.chmod(hdfsDest, "u+rw,g+rw,o+rw");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM