i'm studying spark/scala and i need want to filter a RDD by a specific field on a column, in this case, user
.
I want to return a RDD with the users ["Joe","Plank","Willy"]
but can't seem to figure out how
This is my RDD:
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Tracy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Roger"}
Expected output:
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
I've extract the rdd using spark with something like this(pseudocode):
val sparkConf = new SparkConf().setAppName("MyApp")
master.foreach(sparkConf.setMaster)
val sc = new SparkContext(sparkConf)
val rdd = sc.textFile(inputDir)
rdd.filter(_.contains("\"user\":\"THE_ARRAY_OF_NAMES_"))
It's easier for you to use dataframes.
Using from_json function you can transform that json column into multiple columns
val jsonSchema = StructType(Array(
StructField("request_method",StringType,true),
StructField("request_length",IntegerType,true),
StructField("user",StringType,true)
))
val myDf = spark.read.option("header", "true").csv(path)
val formatedDf = myDf.withColumn("formated_json", from_json($"column_name", jsonSchema)
.select($"formated_json.*")
.where($"user".isin("Joe","Plank","Willy")
formatedDf.show
But if you want a RDD aproach, please let me know.
Edit with RDD version: Remember this is one of manny approaches
//Define a regex pattern
val Pattern = """(?i)"user":"([a-zA-Z]+)"""".r
//Define a Set with your filtered values
val userSet = Set("Joe","Plank","Willy")
//Filter only the values you want
val filteredRdd = rdd.filter( x => {
//Extract the user using the pattern we just declared
val user = for(m <- Pattern.findFirstMatchIn(x)) yield m.group(1)
//If the user variable is equal with one of your set values then this statement will return true and based on that the row will be kept
userSet(user.getOrElse(""))
})
To see if the result is right, you can use:
filteredRdd.collect().foreach(println)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.