简体   繁体   English

无法在Java Spark中读取文件

[英]unable to read file in java spark

I am trying to run the spark program on java using eclipse. 我正在尝试使用Eclipse在Java上运行spark程序。 Its is running if i simply print something on console but I am not able to read any file using textFile function. 如果我只是在控制台上打印一些内容,但我无法使用textFile函数读取任何文件,则它正在运行。 I have read somewhere that reading a file can only be done using HDFS but I am not able to do in my local system. 我在某处读到,只能使用HDFS读取文件,但在本地系统中却无法读取。 Do let me know how to access/read file , if using HDFS then how to install HDFS in my local system so that i can rad the text file. 请让我知道如何访问/读取文件(如果使用HDFS),然后如何在本地系统中安装HDFS,以便可以存储文本文件。

Here's a code on which I am testing , though this program is working fine but it is unable to read file saying Input path does not exist. 这是我正在测试的代码,尽管该程序可以正常工作,但无法读取文件,提示输入路径不存在。

package spark;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

import org.apache.spark.api.java.function.Function;

public class TestSpark {

    public static void main(String args[])
    {
        String[] jars = {"D:\\customJars\\spark.jar"};
        System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
        SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark")
                .setMaster("spark://10.1.50.165:7077")
                .setJars(jars);

        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        SQLContext sqlcon = new SQLContext(jsc);
        String inputFileName = "./forecaster.txt" ;
        JavaRDD<String> logData = jsc.textFile(inputFileName);
        long numAs = logData.filter(new Function<String, Boolean>() {

            @Override
            public Boolean call(String s) throws Exception {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {
              public Boolean call(String s) { return s.contains("b"); }
            }).count();

         System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
        System.out.println("sadasdasdf");

        jsc.stop();
        jsc.close();
    }

}

My File Struture : 我的文件结构: 在此处输入图片说明

Update : you don't have .txt extension in file name and you are using it in your application. 更新 :您的文件名没有.txt扩展名,您正在应用程序中使用它。 You should use it as String inputFileName = "forecaster" ; 您应该将其用作String inputFileName = "forecaster" ;

If file is in same folder as java class TestSpark ( $APP_HOME ): 如果文件与Java类TestSpark$APP_HOME )位于同一文件夹中:

String inputFileName = "forecaster.txt" ;

If file is in Data dir under your project of spark: 如果文件在您的spark项目下的Data dir中:

String inputFileName = "Data\\forecaster.txt" ;

Or use fully qualified Path log says from below testing: 或使用完全合格的Path日志从以下测试中说:

16/08/03 08:25:25 INFO HadoopRDD: Input split: file:/C:/Users/user123/worksapce/spark-java/forecaster.txt
~~~~~~~
String inputFileName = "file:/C:/Users/user123/worksapce/spark-java/forecaster.txt" ;

For example: I copied your code and ran on my local environment: 例如:我复制了您的代码并在我的本地环境中运行:

this is how my project step up is, and I run it as: 这就是我的项目提升的方式,我将其运行为:

 String inputFileName = "forecaster.txt" ;

Test File: 测试文件:

this is test file
aaa
bbb
ddddaaee
ewwww
aaaa
a
a
aaaa
bb

在此处输入图片说明

Code that I used: 我使用的代码:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class TestSpark {

    public static void main(String args[])
    {
       // String[] jars = {"D:\\customJars\\spark.jar"};
       // System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
        SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark").setMaster("local");
                //.setMaster("spark://10.1.50.165:7077")
                //.setJars(jars);

        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        //SQLContext sqlcon = new SQLContext(jsc);
        String inputFileName = "forecaster.txt" ;
        JavaRDD<String> logData = jsc.textFile(inputFileName);
        long numAs = logData.filter(new Function<String, Boolean>() {

            @Override
            public Boolean call(String s) throws Exception {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {

            public Boolean call(String s) { return s.contains("b"); }
            }).count();

         System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
        System.out.println("sadasdasdf");

        jsc.stop();
        jsc.close();
    }

}

Spark needs schema and proper path in order to understand how to read the file. Spark需要架构和正确的路径,以了解如何读取文件。 So if you are reading from HDFS, you should use: 因此,如果您正在阅读HDFS,则应使用:

jsc.textFile("hdfs://host/path/to/hdfs/file/input.txt");

If you are reading local file (local to the worker node, not the machine the driver is running), you should use: 如果您正在读取本地文件(对于工作节点是本地文件,而不是驱动程序正在运行的机器),则应使用:

jsc.textFile("file://path/to/hdfs/file/input.txt");

For reading Hadoop Archive File (HAR), you should use: 要读取Hadoop存档文件(HAR),应使用:

jsc.textFile("har://archive/path/to/hdfs/file/input.txt");

And so on. 等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM