[英]Filter RDD's csv with JSON field using Spark/Scala
我正在研究 spark/scala,我需要按列上的特定字段過濾 RDD,在本例中為user
。
我想與用戶["Joe","Plank","Willy"]
返回一個 RDD,但似乎無法弄清楚如何
這是我的RDD:
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Tracy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Roger"}
預期 output:
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
我已經使用類似這樣的spark(偽代碼)提取了rdd:
val sparkConf = new SparkConf().setAppName("MyApp")
master.foreach(sparkConf.setMaster)
val sc = new SparkContext(sparkConf)
val rdd = sc.textFile(inputDir)
rdd.filter(_.contains("\"user\":\"THE_ARRAY_OF_NAMES_"))
您更容易使用數據框。
使用 from_json function 您可以將 json 列轉換為多列
val jsonSchema = StructType(Array(
StructField("request_method",StringType,true),
StructField("request_length",IntegerType,true),
StructField("user",StringType,true)
))
val myDf = spark.read.option("header", "true").csv(path)
val formatedDf = myDf.withColumn("formated_json", from_json($"column_name", jsonSchema)
.select($"formated_json.*")
.where($"user".isin("Joe","Plank","Willy")
formatedDf.show
但是,如果您想要 RDD 方法,請告訴我。
使用 RDD 版本編輯:記住這是許多方法之一
//Define a regex pattern
val Pattern = """(?i)"user":"([a-zA-Z]+)"""".r
//Define a Set with your filtered values
val userSet = Set("Joe","Plank","Willy")
//Filter only the values you want
val filteredRdd = rdd.filter( x => {
//Extract the user using the pattern we just declared
val user = for(m <- Pattern.findFirstMatchIn(x)) yield m.group(1)
//If the user variable is equal with one of your set values then this statement will return true and based on that the row will be kept
userSet(user.getOrElse(""))
})
要查看結果是否正確,可以使用:
filteredRdd.collect().foreach(println)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.