简体   繁体   中英

Filter RDD's csv with JSON field using Spark/Scala

i'm studying spark/scala and i need want to filter a RDD by a specific field on a column, in this case, user .

I want to return a RDD with the users ["Joe","Plank","Willy"] but can't seem to figure out how

This is my RDD:

2020-03-01T00:00:05Z    my.local5.url   {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z    my.local6.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Tracy"}
2020-03-01T00:00:05Z    my.local6.url   {"request_method":"GET","request_length":281,"user":"Roger"}

Expected output:

2020-03-01T00:00:05Z    my.local5.url   {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z    my.local6.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}

I've extract the rdd using spark with something like this(pseudocode):

val sparkConf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(sparkConf)

val rdd = sc.textFile(inputDir)

It's easier for you to use dataframes.

Using from_json function you can transform that json column into multiple columns

val jsonSchema = StructType(Array(

val myDf = spark.read.option("header", "true").csv(path)
val formatedDf = myDf.withColumn("formated_json", from_json($"column_name", jsonSchema)


But if you want a RDD aproach, please let me know.

Edit with RDD version: Remember this is one of manny approaches

//Define a regex pattern
val Pattern = """(?i)"user":"([a-zA-Z]+)"""".r
//Define a Set with your filtered values
val userSet = Set("Joe","Plank","Willy")
//Filter only the values you want
val filteredRdd = rdd.filter( x => {
    //Extract the user using the pattern we just declared
    val user = for(m <- Pattern.findFirstMatchIn(x)) yield m.group(1)
    //If the user variable is equal with one of your set values then this statement will return true and based on that the row will be kept

To see if the result is right, you can use:


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM