简体   繁体   中英

using apache spark for temperature prediction

I am a newbie with respect to spark and have just started some serious work with it.
We are building a platform where we are receiving temperature data from stations at a particular timestamp. So the data is getting posted to RabbitMQ as a csv eg

WD1,12.3,15-10-12T12:23:45
WD2,12.4,15-10-12T12:24:45
WD1,12.3,15-10-12T12:25:45
WD1,22.3,15-10-12T12:26:45

We are dumping the data into Cassandra and we wanted to use spark for building a model out of it . What we aim from the model is to find sharp temperature raise that happens within a short time frame window. As an example , in the data there is a 10 degree rise in temperature within 1 minute .I was thinking of using Linear Regression in order to build the model . However the spark Linear regression model seems to only accept double values and after reading the documentation i understand that the equation for finding weights is more in the form of

y = a1x1+a2x2+a3x3

than

y = mx+c

So spark can give weights and the intercept values. But I am not sure I can use this model . Just to satisfy my curiosity , I did try to build the model out of this data. But all of the predictions were horrendous and I think the data as well. I tried to build a matrix of temperature vs timestamp and the predictions were pretty incorrect.

My questions are the following

  1. Is the way that I am building the model completely wrong. If so , How do i rectify it?
  2. If not Linear Regression Model , Is there any other model mechanism that can indicate this sharp rise ?

My Sample code:

JavaRDD<LabeledPoint> parsedData = cassandraRowsRDD.map(new Function<String, LabeledPoint>() {
            public LabeledPoint call(String line) {
                String[] parts = line.split(",");
                double value = Double.parseDouble(parts[1]);
                System.out.println("Y = " + Double.parseDouble(parts[0]) + " :: TIMESTAMP = " + value);
                return new LabeledPoint(Double.parseDouble(parts[0]), Vectors.dense(value));
            }
        });
        parsedData.cache();

        StandardScaler scaler = new StandardScaler();
        DataFrame dataFrame = sqlContext.createDataFrame(parsedData, LabeledPoint.class);
        System.out.println(dataFrame.count());

        dataFrame.printSchema();

        LinearRegression lr = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);

        // Fit the model
        LinearRegressionModel lrModel = lr.fit(dataFrame);
        System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());

I'm not sure the choice of building a Linear Regression model is the best for what you're trying to do. First, a model is typically used to make predictions. If temperature was your variable of interest and you were using time as the independent variable, that would mean you would make predictions of temperature at times that you do not have measurements using data points where you do have measurements. Or if you were trying to show that global average temperature is rising with time, fitting a linear model may be a way of doing that. This is not what you're trying to do.

It sounds to me that you just want to crunch the data, not model it and make predictions. It seems you just want to subtract all points at a location within 1 minute of one another and notify you if that temperature difference is greater than 10 degrees.

In that case, the devil is in the details. Are you only interested in changes in 10 degrees from the exact same station? Or can it be any sensor within the same region? In either case, this is more of a data processing problem than modeling. If you want to, for instance, collect data all day and then run a script that analyzes it tomorrow, then Spark may be a good candidate. If, on the other hand, you want the system to continually monitor the data and flag you in real-time, Spark is probably not the best choice. In that case you might want to look at Apache Storm. I'm not an expert in Storm, but I know their approximate use case is handling streaming, distributed data. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM