简体   繁体   English

Spark:同一应用程序最佳实践中的两个SparkContext

[英]Spark: Two SparkContexts in a single Application Best Practice

I think I have an interesting question for all of you today. 我想我今天对大家有一个有趣的问题。 In the code below you will notice I have two SparkContexts one for SparkStreaming and the other one which is a normal SparkContext. 在下面的代码中,您将注意到我有两个SparkContext,一个用于SparkStreaming,另一个是普通的SparkContext。 According to best practices you should only have one SparkContext in a Spark application even though its possible to circumvent this via allowMultipleContexts in the configuration. 根据最佳实践,即使可以通过配置中的allowMultipleContexts规避此问题,您在Spark应用程序中也应仅具有一个SparkContext。

Problem is, I need to retrieve data from hive and from a Kafka topic to do some logic, and whenever I submit my application it obviously returns "Cannot have 2 Spark Contexts Running on JVM". 问题是,我需要从配置单元和Kafka主题中检索数据以执行一些逻辑,并且每当我提交应用程序时,它显然会返回“无法在JVM上运行2个Spark上下文”。

My question is, is there a correct way to do this than how I am doing it right now? 我的问题是,有比我现在做的更正确的方法吗?

public class MainApp {

private final String logFile= Properties.getString("SparkLogFileDir");
private static final String KAFKA_GROUPID = Properties.getString("KafkaGroupId");
private static final String ZOOKEEPER_URL = Properties.getString("ZookeeperURL");
private static final String KAFKA_BROKER = Properties.getString("KafkaBroker");
private static final String KAFKA_TOPIC = Properties.getString("KafkaTopic");
private static final String Database = Properties.getString("HiveDatabase");
private static final Integer KAFKA_PARA = Properties.getInt("KafkaParrallel");

public static void main(String[] args){
    //set settings
    String sql="";

    //START APP
    System.out.println("Starting NPI_TWITTERAPP...." + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    System.out.println("Configuring Settings...."+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    SparkConf conf = new SparkConf()
            .setAppName(Properties.getString("SparkAppName"))
            .setMaster(Properties.getString("SparkMasterUrl"));

    //Set Spark/hive/sql Context
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(5000));
    JavaHiveContext HiveSqlContext = new JavaHiveContext(sc);

    //Check if Twitter Hive Table Exists
    try {
        HiveSqlContext.sql("DROP TABLE IF EXISTS "+Database+"TWITTERSTORE");
        HiveSqlContext.sql("CREATE TABLE IF NOT EXISTS "+Database+".TWITTERSTORE "
        +" (created_at String, id String, id_str String, text String, source String, truncated String, in_reply_to_user_id String, processed_at String, lon String, lat String)"
        +" STORED AS TEXTFILE");
    }catch(Exception e){
        System.out.println(e);
    }
    //Check if Ivapp Table Exists

    sql ="CREATE TABLE IF NOT EXISTS "+Database+".IVAPPGEO AS SELECT DISTINCT a.LATITUDE, a.LONGITUDE, b.ODNCIRCUIT_OLT_CLLI, b.ODNCIRCUIT_OLT_TID, a.CITY, a.STATE, a.ZIP FROM "
            +Database+".T_PONNMS_SERVICE B, "
            +Database+".CLLI_LATLON_MSTR A WHERE a.BID_CLLI = substr(b.ODNCIRCUIT_OLT_CLLI,0,8)";
    try {
        System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
        HiveSqlContext.sql(sql);

        sql = "SELECT LATITUDE, LONGITUDE, ODNCIRCUIT_OLT_CLLI, ODNCIRCUIT_OLT_TID, CITY, STATE, ZIP FROM "+Database+".IVAPPGEO";

        JavaSchemaRDD RDD_IVAPPGEO = HiveSqlContext.sql(sql).cache();

    }catch(Exception e){
        System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    }

    //JavaHiveContext hc = new JavaHiveContext();
    System.out.println("Retrieve Data from Kafka Topic: "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
    Map<String, Integer> topicMap = new HashMap<String, Integer>();
    topicMap.put(KAFKA_TOPIC,KAFKA_PARA);

    JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
                jssc, KAFKA_GROUPID, ZOOKEEPER_URL, topicMap);

    JavaDStream<String> json = messages.map(
            new Function<Tuple2<String, String>, String>() {
                private static final long serialVersionUID = 42l;
                @Override
                public String call(Tuple2<String, String> message) {
                    return message._2();
                }
            }
    );
    System.out.println("Completed Kafka Messages... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));


    System.out.println("Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    JavaPairDStream<Long, String> tweets = json.mapToPair(
            new TwitterFilterFunction());

    JavaPairDStream<Long, String> filtered = tweets.filter(
            new Function<Tuple2<Long, String>, Boolean>() {
                private static final long serialVersionUID = 42l;
                @Override
                public Boolean call(Tuple2<Long, String> tweet) {
                    return tweet != null;
                }
            }
    );

    JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
            new TextFilterFunction());

    tweetsFiltered = tweetsFiltered.map(
            new StemmingFunction());

    System.out.println("Finished Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));



    System.out.println("Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    //calculate postive tweets
    JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
            tweetsFiltered.mapToPair(new PositiveScoreFunction());
    //calculate negative tweets
    JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
            tweetsFiltered.mapToPair(new NegativeScoreFunction());

    JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
            positiveTweets.join(negativeTweets);

    //Score tweets
    JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
            joined.map(new Function<Tuple2<Tuple2<Long, String>,
                    Tuple2<Float, Float>>,
                    Tuple4<Long, String, Float, Float>>() {
                private static final long serialVersionUID = 42l;
                @Override
                public Tuple4<Long, String, Float, Float> call(
                        Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
                {
                    return new Tuple4<Long, String, Float, Float>(
                            tweet._1()._1(),
                            tweet._1()._2(),
                            tweet._2()._1(),
                            tweet._2()._2());
                }
            });

    System.out.println("Finished Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    System.out.println("Outputting Tweets Data to flat file "+Properties.getString("HdfsOutput")+" ... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));

    JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
            scoredTweets.map(new ScoreTweetsFunction());

    result.foreachRDD(new FileWriter());

    System.out.println("Outputting Sentiment Data to Hive... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));


    jssc.start();
    jssc.awaitTermination();
}

} }

Creating SparkContext 创建SparkContext

You can create a SparkContext instance with or without creating a SparkConf object first. 您可以先创建一个SparkContext实例,然后再创建一个或不创建一个SparkConf对象。

Getting Existing or Creating New SparkContext (getOrCreate methods) 获取现有或创建新的SparkContext(getOrCreate方法)

getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext

SparkContext.getOrCreate methods allow you to get the existing SparkContext or create a new one. SparkContext.getOrCreate方法允许您获取现有的SparkContext或创建一个新的SparkContext。

import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()

// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
  .setMaster("local[*]")
  .setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)

Refer Here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html 在这里参考-https: //jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html

显然,如果我在执行JavaStreaming Context之前使用sc.close()关闭原始的SparkContext,它会完美运行,没有错误或问题。

you can use a singleton object ContextManager which would handle which context to provide. 您可以使用单例对象ContextManager来处理要提供的上下文。

public class ContextManager {

private static JavaSparkContext context;
private static String currentType;

private ContextManager() {}

public static JavaSparkContext getContext(String type) {

if(type == currentType && context != null) {

   return context;
}
else if (type == "streaming"){

     .. clean up the current context ..
     .. initialize the context to streaming context ..
     currentType = type;
}
else {
    ..clean up the current context..
    ... initialize the context to normal context ..
    currentType = type;


  }

 return context;

 }

}

There are some issues like in projects where you switch context quite rapidly the overhead would be quite large. 在项目中,有些问题会导致您快速切换上下文,因此开销会非常大。

You can access the SparkContext from your JavaStreamingSparkContext, and use that reference when creating additional contexts. 您可以从JavaStreamingSparkContext访问SparkContext,并在创建其他上下文时使用引用。

SparkConf sparkConfig = new SparkConf().setAppName("foo");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConfig, Duration.seconds(30));
SqlContext sqlContext = new SqlContext(jssc.sparkContext());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM