简体   繁体   中英

spark loop in matrix to run linear regression

I have a spark data frame dt as below. BAB is ID and I would like to run a linear regression with column AAB and AAD for every value of BAB.

数据框

This is how I run it. By filtering the whole dataframe for every BAB value, it gets really slow. Is there a way of looping the data like a 3-dimensional matrix and running a regression for every BAB? So that I need to go through BAB once only. It does not have to be spark mllib. Any other machine learning tool with scala coding is fine.

val arrColu = Array("AAB", "AAD");
val assFeat = new VectorAssembler().setInputCols(arrColu).setOutputCol("features");

val arrBAB=dt.select("BAB").collect.map(_ (0)).map(x => x.toString);

for (a<-0 to arrBAB.length-1){

val vecDF: DataFrame = assFeat.transform(dt.filter("BAB='"+arrBAB(a)+"'").select("AAB","AAD"));
val lr1=new LinearRegression();
val lr2=lr1.setFeaturesCol("features").setLabelCol("AAD").setFitIntercept(true).
  setMaxIter(10).setRegParam(.3).setElasticNetParam(.8);
val fitD1=lr2.fit(vecDF);
...
}

One way is converting the data frame into a list with tuples as element List((BAB1,AAB1,AAD1),(BAB2,AAB2,AAD2),...) , then slicing the list w.r.t each individual BAB and running regression on each slice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM