简体   繁体   English

在独立群集上提交Spark应用程序

[英]Submitting Spark application on standalone cluster

I am rather new at using Spark and I am having issues running a simple word count application on a standalone cluster. 我是使用Spark的新手,我在独立群集上运行简单的字数统计应用程序时遇到问题。 I have a cluster consisting of one master node and one worker, launched on AWS using the spark-ec2 script. 我有一个由一个主节点和一个工作组成的集群,使用spark-ec2脚本在AWS上启动。 Everything works fine when I run the code locally using ./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount 当我使用./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount在本地运行代码时一切正常./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount

This saves the output into the specified directory as it should. 这会将输出保存到指定的目录中。

When I try to run the application using ./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount 当我尝试使用./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount运行应用程序时./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount

it just keeps on running and never produce a final result. 它只是继续运行,永远不会产生最终结果。 The directory gets created but only a temporary file of 0 bytes is present. 目录已创建,但只存在0字节的临时文件。

According to the Spark UI it keeps on running the mapToPair function indefinitely. 根据Spark UI,它会无限期地继续运行mapToPair函数。 Here is a picture of the Spark UI 这是Spark UI的图片

Does anyone know why this is happening and how to solve it? 有谁知道为什么会这样,以及如何解决它?

Here is the code: 这是代码:

public class SparkDataAnalysis {
    public static void main(String args[]){
        SparkConf conf = new SparkConf().setAppName("SparkDataAnalysis");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> input = sc.textFile( args[0] );

        JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );

        JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2<String, Integer>( t, 1 ) ).reduceByKey( (x, y) -> x + y );

        counts.saveAsTextFile( args[1] );
    }
}

I skipped using a standalone cluster via the spark-ec2 script and used Amazon EMR instead. 我通过spark-ec2脚本跳过使用独立群集,而是使用Amazon EMR。 There everything worked perfectly. 一切都很完美。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM