[英]How to use regex in Spark Scala to convert RDD to Dataframe after reading an unstructured text file?
package sparkscala2.test
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions.regexp_extract
object example1 {
//case class MyType(Field_Name: String)
// case class click(rowtime:Map[String,String])
case class click(rowtime:String,key:String,ip:String,userid:String,remote_user:String,time:String,_time:String,request:String,status:String,bytes:String,referrer:String,agent:String)
def main(args:Array[String]):Unit={
System.setProperty("hadoop.home.dir", "C:\\hadoop\\")
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("Error")
import spark.implicits._
val rdd = spark.sparkContext.textFile("file:///C://Users//User//Desktop//test1.txt")
val clean_rdd=rdd.map(x=>x.replace("value: {","")).map(x=>x.replace("}","")).map(x=>x.replace("\"",""))
val schema_rdd=clean_rdd.map(x=>x.split(",")).map(x=>click(x(0).split(":")(1),x(1).split(":")(1),x(2).split(":")(1),x(3).split(":")(1),x(4).split(":")(1),x(5).split(":")(1),
x(6).split(":")(1),x(7).split(":")(1),x(8).split(":")(1),x(9).split(":")(1),x(10).split(":")(1),x(11).split(":")(1)))
val final_df=schema_rdd.toDF()
final_df.show(false)
}
}
input file : test1.txt
rowtime: 2020/06/11 10:38:42.449 Z, key: 222.90.225.227, value: {"ip":"222.90.225.227","userid":12,"remote_user":"-","time":"1","_time":1,"request":"GET /images/logo-small.png HTTP/1.1","status":"302","bytes":"1289","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
rowtime: 2020/06/11 10:38:42.705 Z, key: 122.152.45.245, value: {"ip":"122.152.45.245","userid":11,"remote_user":"-","time":"21","_time":21,"request":"GET /images/logo-small.png HTTP/1.1","status":"407","bytes":"4196","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
output:
+--------------+----------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------+
|rowtime |key |ip |userid|remote_user|time|_time|request |status|bytes|referrer|agent |
+--------------+----------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------+
| 2020/06/11 10| 222.90.225.227 |222.90.225.227 |12 |- |1 |1 |GET /images/logo-small.png HTTP/1.1|302 |1289 |- |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML|
| 2020/06/11 10| 111.245.174.248|111.245.174.248|30 |- |11 |11 |GET /site/login.html HTTP/1.1 |302 |14096|- |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML|
| 2020/06/11 10| 122.152.45.245 |122.152.45.245 |11 |- |21 |21 |GET /images/logo-small.png HTTP/1.1|407 |4196 |- |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML|
+--------------+----------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------+
试过上面的代码,但是在output中,由于数据中有分号和逗号,rowtime和agent列的数据显示不完整。 rowtime 列在数据中有分号,因此没有显示剩余数据,因为我已经用分号拆分来分隔键/值,对于 agent 列,我们在数据中有逗号,所以没有显示剩余数据得到显示,因为一开始我使用逗号分隔。
有什么方法可以在映射案例 class 模式时使用正则表达式 function? 还是有其他办法?
您可以将文件读取为数据集/数据帧,而不是使用案例类和 RDD,然后进行转换以获得所需的结果。
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
// Assuming you have fixed structure. You can improvise this regex as required
val regex = "([a-z-A-Z]+:\\s+)([\\s+\\d:\\./]+Z)([,\\s+a-z-A-Z]+:\\s+)([\\d\\.]+)([,\\s+]+[a-z-A-Z]+:)(.*)"
/*
([a-z-A-Z]+:\\s+) --> Matches rowtime:
([\\s+\\d:\\./]+Z) --> Matches rowtime value eg. 2020/06/11 10:38:42.449 Z
([,\\s+a-z-A-Z]+:\\s+) --> Matches , key:
([\\d\\.]+) --> Matches content of key e.g 222.90.225.227
([,\\s+]+[a-z-A-Z]+:) --> Matches , value:
(.*) --> Matches content of value field which is in json
*/
// Read file as dataframe and using regex and grouping ID extract column content
var df = spark.read.textFile("sample.txt")
.select(regexp_extract('value, regex, 2).as("rowtime"),
regexp_extract('value, regex, 4).as("key"),
regexp_extract('value, regex, 6).as("value"))
// Since value is json we can make use from_json to create struct field
df = df.withColumn("value", from_json('value, schema_of_json(df.select("value").first().getString(0))))
// select all the column including nested columns of value column
df.select("rowtime", "key", "value.*").show(false)
+-------------------------+---------------+-----+-------------------------------------------------------------------------------------------------------------------+-----+---------------+--------+-----------+-----------------------------------+------+----+------+
|rowtime |key |_time|agent |bytes|ip |referrer|remote_user|request |status|time|userid|
+-------------------------+---------------+-----+-------------------------------------------------------------------------------------------------------------------+-----+---------------+--------+-----------+-----------------------------------+------+----+------+
|2020/06/11 10:38:42.449 Z|222.90.225.227 |1 |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|1289 |222.90.225.227 |- |- |GET /images/logo-small.png HTTP/1.1|302 |1 |12 |
|2020/06/11 10:38:42.528 Z|111.245.174.248|11 |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|14096|111.245.174.248|- |- |GET /site/login.html HTTP/1.1 |302 |11 |30 |
|2020/06/11 10:38:42.705 Z|122.152.45.245 |21 |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|4196 |122.152.45.245 |- |- |GET /images/logo-small.png HTTP/1.1|407 |21 |11 |
+-------------------------+---------------+-----+-------------------------------------------------------------------------------------------------------------------+-----+---------------+--------+-----------+-----------------------------------+------+----+------+
您的数据几乎看起来像 JSON,但缺少一些字段的双引号和最终的花括号包装器。
尝试以下 - 更多 spark-sql 方法
使用硬编码字符串模拟 dataframe。 你可以从文件中得到这个。
val df = Seq("""rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"} """,
"""rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}""",
"""rowtime: 2020/06/11 10:38:42.705 Z, key: 122.152.45.245, value: {"ip":"122.152.45.245","userid":11,"remote_user":"-","time":"21","_time":21,"request":"GET /images/logo-small.png HTTP/1.1","status":"407","bytes":"4196","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}"""
).toDF("x")
创建临时视图
df.show(false)
df.createOrReplaceTempView("df")
输入数据:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"} |
|rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"} |
|rowtime: 2020/06/11 10:38:42.705 Z, key: 122.152.45.245, value: {"ip":"122.152.45.245","userid":11,"remote_user":"-","time":"21","_time":21,"request":"GET /images/logo-small.png HTTP/1.1","status":"407","bytes":"4196","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
现在使用正则表达式函数转换单列“x”以修复双引号问题和花括号,使其成为有效的 json 文字
val df1 = spark.sql("""
with t1 ( select x from df ),
t2 ( select regexp_replace(x,"(rowtime|key|value):","\"$1\":") x from t1 ),
t3 ( select regexp_replace(x,"(\"rowtime\":)\\s+([^,]+),","$1 \"$2\",") x from t2 ),
t4 ( select regexp_replace(x,"(\"key\":)\\s+([^,]+),","$1 \"$2\",") x from t3 )
select '{'||x||'}' y from t4
""")
df1.printSchema()
现在,“y”列的每一行都是有效的 json 文字。 使用具有 json 字符串的单列转换此 dataframe,步骤如下 dataframe
import spark.implicits._
val df2 = spark.read.json(df1.as[String])
df2.printSchema
root
|-- key: string (nullable = true)
|-- rowtime: string (nullable = true)
|-- value: struct (nullable = true)
| |-- _time: long (nullable = true)
| |-- agent: string (nullable = true)
| |-- bytes: string (nullable = true)
| |-- ip: string (nullable = true)
| |-- referrer: string (nullable = true)
| |-- remote_user: string (nullable = true)
| |-- request: string (nullable = true)
| |-- status: string (nullable = true)
| |-- time: string (nullable = true)
| |-- userid: long (nullable = true)
在此之上创建一个视图
df2.createOrReplaceTempView("df2")
现在,使用 spark-sql 并获取 output 所需的元素。
spark.sql("""
select rowtime , key, value.ip ip, value.userid userid, value.remote_user remote_user,
value.time time, value._time _time, value.request request, value.status status, value.bytes bytes,
value.referrer referrer, value.agent agent
from df2
""").show(false)
Output:
+-------------------------+---------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------------------------------------------------------+
|rowtime |key |ip |userid|remote_user|time|_time|request |status|bytes|referrer|agent |
+-------------------------+---------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------------------------------------------------------+
|2020/06/11 10:38:42.528 Z|111.245.174.248|111.245.174.248|30 |- |11 |11 |GET /site/login.html HTTP/1.1 |302 |14096|- |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|
|2020/06/11 10:38:42.528 Z|111.245.174.248|111.245.174.248|30 |- |11 |11 |GET /site/login.html HTTP/1.1 |302 |14096|- |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|
|2020/06/11 10:38:42.705 Z|122.152.45.245 |122.152.45.245 |11 |- |21 |21 |GET /images/logo-small.png HTTP/1.1|407 |4196 |- |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|
+-------------------------+---------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.