简体   繁体   中英

Spark: Two SparkContexts in a single Application Best Practice

I think I have an interesting question for all of you today. In the code below you will notice I have two SparkContexts one for SparkStreaming and the other one which is a normal SparkContext. According to best practices you should only have one SparkContext in a Spark application even though its possible to circumvent this via allowMultipleContexts in the configuration.

Problem is, I need to retrieve data from hive and from a Kafka topic to do some logic, and whenever I submit my application it obviously returns "Cannot have 2 Spark Contexts Running on JVM".

My question is, is there a correct way to do this than how I am doing it right now?

public class MainApp {

private final String logFile= Properties.getString("SparkLogFileDir");
private static final String KAFKA_GROUPID = Properties.getString("KafkaGroupId");
private static final String ZOOKEEPER_URL = Properties.getString("ZookeeperURL");
private static final String KAFKA_BROKER = Properties.getString("KafkaBroker");
private static final String KAFKA_TOPIC = Properties.getString("KafkaTopic");
private static final String Database = Properties.getString("HiveDatabase");
private static final Integer KAFKA_PARA = Properties.getInt("KafkaParrallel");

public static void main(String[] args){
    //set settings
    String sql="";

    //START APP
    System.out.println("Starting NPI_TWITTERAPP...." + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    System.out.println("Configuring Settings...."+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    SparkConf conf = new SparkConf()
            .setAppName(Properties.getString("SparkAppName"))
            .setMaster(Properties.getString("SparkMasterUrl"));

    //Set Spark/hive/sql Context
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(5000));
    JavaHiveContext HiveSqlContext = new JavaHiveContext(sc);

    //Check if Twitter Hive Table Exists
    try {
        HiveSqlContext.sql("DROP TABLE IF EXISTS "+Database+"TWITTERSTORE");
        HiveSqlContext.sql("CREATE TABLE IF NOT EXISTS "+Database+".TWITTERSTORE "
        +" (created_at String, id String, id_str String, text String, source String, truncated String, in_reply_to_user_id String, processed_at String, lon String, lat String)"
        +" STORED AS TEXTFILE");
    }catch(Exception e){
        System.out.println(e);
    }
    //Check if Ivapp Table Exists

    sql ="CREATE TABLE IF NOT EXISTS "+Database+".IVAPPGEO AS SELECT DISTINCT a.LATITUDE, a.LONGITUDE, b.ODNCIRCUIT_OLT_CLLI, b.ODNCIRCUIT_OLT_TID, a.CITY, a.STATE, a.ZIP FROM "
            +Database+".T_PONNMS_SERVICE B, "
            +Database+".CLLI_LATLON_MSTR A WHERE a.BID_CLLI = substr(b.ODNCIRCUIT_OLT_CLLI,0,8)";
    try {
        System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
        HiveSqlContext.sql(sql);

        sql = "SELECT LATITUDE, LONGITUDE, ODNCIRCUIT_OLT_CLLI, ODNCIRCUIT_OLT_TID, CITY, STATE, ZIP FROM "+Database+".IVAPPGEO";

        JavaSchemaRDD RDD_IVAPPGEO = HiveSqlContext.sql(sql).cache();

    }catch(Exception e){
        System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    }

    //JavaHiveContext hc = new JavaHiveContext();
    System.out.println("Retrieve Data from Kafka Topic: "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    Map<String, Integer> topicMap = new HashMap<String, Integer>();
    topicMap.put(KAFKA_TOPIC,KAFKA_PARA);

    JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
                jssc, KAFKA_GROUPID, ZOOKEEPER_URL, topicMap);

    JavaDStream<String> json = messages.map(
            new Function<Tuple2<String, String>, String>() {
                private static final long serialVersionUID = 42l;
                @Override
                public String call(Tuple2<String, String> message) {
                    return message._2();
                }
            }
    );
    System.out.println("Completed Kafka Messages... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));


    System.out.println("Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    JavaPairDStream<Long, String> tweets = json.mapToPair(
            new TwitterFilterFunction());

    JavaPairDStream<Long, String> filtered = tweets.filter(
            new Function<Tuple2<Long, String>, Boolean>() {
                private static final long serialVersionUID = 42l;
                @Override
                public Boolean call(Tuple2<Long, String> tweet) {
                    return tweet != null;
                }
            }
    );

    JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
            new TextFilterFunction());

    tweetsFiltered = tweetsFiltered.map(
            new StemmingFunction());

    System.out.println("Finished Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));



    System.out.println("Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    //calculate postive tweets
    JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
            tweetsFiltered.mapToPair(new PositiveScoreFunction());
    //calculate negative tweets
    JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
            tweetsFiltered.mapToPair(new NegativeScoreFunction());

    JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
            positiveTweets.join(negativeTweets);

    //Score tweets
    JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
            joined.map(new Function<Tuple2<Tuple2<Long, String>,
                    Tuple2<Float, Float>>,
                    Tuple4<Long, String, Float, Float>>() {
                private static final long serialVersionUID = 42l;
                @Override
                public Tuple4<Long, String, Float, Float> call(
                        Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
                {
                    return new Tuple4<Long, String, Float, Float>(
                            tweet._1()._1(),
                            tweet._1()._2(),
                            tweet._2()._1(),
                            tweet._2()._2());
                }
            });

    System.out.println("Finished Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    System.out.println("Outputting Tweets Data to flat file "+Properties.getString("HdfsOutput")+" ... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
            scoredTweets.map(new ScoreTweetsFunction());

    result.foreachRDD(new FileWriter());

    System.out.println("Outputting Sentiment Data to Hive... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));


    jssc.start();
    jssc.awaitTermination();
}

}

Creating SparkContext

You can create a SparkContext instance with or without creating a SparkConf object first.

Getting Existing or Creating New SparkContext (getOrCreate methods)

getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext

SparkContext.getOrCreate methods allow you to get the existing SparkContext or create a new one.

import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()

// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
  .setMaster("local[*]")
  .setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)

Refer Here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html

显然,如果我在执行JavaStreaming Context之前使用sc.close()关闭原始的SparkContext,它会完美运行,没有错误或问题。

you can use a singleton object ContextManager which would handle which context to provide.

public class ContextManager {

private static JavaSparkContext context;
private static String currentType;

private ContextManager() {}

public static JavaSparkContext getContext(String type) {

if(type == currentType && context != null) {

   return context;
}
else if (type == "streaming"){

     .. clean up the current context ..
     .. initialize the context to streaming context ..
     currentType = type;
}
else {
    ..clean up the current context..
    ... initialize the context to normal context ..
    currentType = type;


  }

 return context;

 }

}

There are some issues like in projects where you switch context quite rapidly the overhead would be quite large.

You can access the SparkContext from your JavaStreamingSparkContext, and use that reference when creating additional contexts.

SparkConf sparkConfig = new SparkConf().setAppName("foo");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConfig, Duration.seconds(30));
SqlContext sqlContext = new SqlContext(jssc.sparkContext());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM