简体   繁体   中英

How to filter data based on three columns in Scala

I am a newbie to scala and I want to iterate through three for loos for one data set and perform some analysis.For example my data is like below:

Sample.csv

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
1,100,1,NA,0,1,0,Friday,1,5
1,100,2,NA,0,1,0,Friday,1,5
1,101,0,NA,0,1,0,Friday,1,5
1,101,1,NA,0,1,0,Friday,1,5
1,101,2,NA,0,1,0,Friday,1,5
1,102,0,NA,0,1,0,Friday,1,5
1,102,1,NA,0,1,0,Friday,1,5
1,102,2,NA,0,1,0,Friday,1,5

So now I read the like below:

val data = sc.textFile("C:/users/ricky/Data.csv")

Now I need to implement a filter for first three columns in scala to filter the subset of the whole data and do some analysis.For example The first three columns are the columns to be filtered. So I have one value for 1st column (1),3 values for second column (100,101,102) and 3 values for third column (0,1,2).So now I need to run the filter to provide subset of the whole data as .Is it good to use loops like below

for {
  i <- 1
  j <- 100 to 102
  k <- 1 to 2
}

which should need subset data like

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

where i=1 ,j=100,and k=0

and up to

1,102,2,NA,0,1,0,Friday,1,5

where i=1 ,j=102,and k=2

How can I run for the data (which I read from CSV) in Scala.

After you read it from text csv file, you can use filter to filter in the data you want as

val tempData = data.map(line => line.split(","))
tempData.filter(array => array(0) == "1" && array(1) == "100" && array(2) == "0").foreach(x => println(x.mkString(",")))

This will give you result as

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

You can do the same for the rest of the cases

Dataframe apis

You can use dataframe api for simplicity, optimized than rdd and many more. First step would be to read the csv to dataframe as

val df = sqlContext.read.format("com.databricks.spark.csv").load("path to csv file")

you will have

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
|1  |100|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|2  |NA |0  |1  |0  |Friday   |1  |5  |
+---+---+---+---+---+---+---+---------+---+---+

Then you can use filter api as in rdd as

import sqlContext.implicits._
val df1 = df.filter($"_c0" === "1" && $"_c1" === "100" && $"_c2" === "0")

you should have

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
+---+---+---+---+---+---+---+---------+---+---+

You can even define schema to have the column names as you want.

Edited

answering your comment below, it all depends on what you ouput

scala> val temp = tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0").map(x => x.mkString(","))
temp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:28

scala> tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0")
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[13] at filter at <console>:29

I hope its clear.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM