阅读非结构化文本文件后，如何在Spark Scala中使用正则表达式将RDD转换为Dataframe？

Question

    package sparkscala2.test
    
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions._
    import org.apache.spark.sql.functions.regexp_extract
    
    
    object example1 {
      //case class MyType(Field_Name: String)
     // case class click(rowtime:Map[String,String])
      case class click(rowtime:String,key:String,ip:String,userid:String,remote_user:String,time:String,_time:String,request:String,status:String,bytes:String,referrer:String,agent:String)
      def main(args:Array[String]):Unit={
       System.setProperty("hadoop.home.dir", "C:\\hadoop\\")
       val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
        spark.sparkContext.setLogLevel("Error")
          import spark.implicits._
       val rdd = spark.sparkContext.textFile("file:///C://Users//User//Desktop//test1.txt")
        val clean_rdd=rdd.map(x=>x.replace("value: {","")).map(x=>x.replace("}","")).map(x=>x.replace("\"",""))
        
        
          val schema_rdd=clean_rdd.map(x=>x.split(",")).map(x=>click(x(0).split(":")(1),x(1).split(":")(1),x(2).split(":")(1),x(3).split(":")(1),x(4).split(":")(1),x(5).split(":")(1),
         x(6).split(":")(1),x(7).split(":")(1),x(8).split(":")(1),x(9).split(":")(1),x(10).split(":")(1),x(11).split(":")(1)))
     val final_df=schema_rdd.toDF()
     final_df.show(false)
      }
    }
    
input file : test1.txt

rowtime: 2020/06/11 10:38:42.449 Z, key: 222.90.225.227, value: {"ip":"222.90.225.227","userid":12,"remote_user":"-","time":"1","_time":1,"request":"GET /images/logo-small.png HTTP/1.1","status":"302","bytes":"1289","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
rowtime: 2020/06/11 10:38:42.705 Z, key: 122.152.45.245, value: {"ip":"122.152.45.245","userid":11,"remote_user":"-","time":"21","_time":21,"request":"GET /images/logo-small.png HTTP/1.1","status":"407","bytes":"4196","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}

       
output:

+--------------+----------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------+
|rowtime       |key             |ip             |userid|remote_user|time|_time|request                            |status|bytes|referrer|agent                                                              |
+--------------+----------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------+
| 2020/06/11 10| 222.90.225.227 |222.90.225.227 |12    |-          |1   |1    |GET /images/logo-small.png HTTP/1.1|302   |1289 |-       |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML|
| 2020/06/11 10| 111.245.174.248|111.245.174.248|30    |-          |11  |11   |GET /site/login.html HTTP/1.1      |302   |14096|-       |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML|
| 2020/06/11 10| 122.152.45.245 |122.152.45.245 |11    |-          |21  |21   |GET /images/logo-small.png HTTP/1.1|407   |4196 |-       |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML|
+--------------+----------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------+

试过上面的代码，但是在output中，由于数据中有分号和逗号，rowtime和agent列的数据显示不完整。 rowtime 列在数据中有分号，因此没有显示剩余数据，因为我已经用分号拆分来分隔键/值，对于 agent 列，我们在数据中有逗号，所以没有显示剩余数据得到显示，因为一开始我使用逗号分隔。

有什么方法可以在映射案例 class 模式时使用正则表达式 function？ 还是有其他办法？

Answer 1

您可以将文件读取为数据集/数据帧，而不是使用案例类和 RDD，然后进行转换以获得所需的结果。

  val spark = SparkSession.builder().master("local[*]").getOrCreate()
  spark.sparkContext.setLogLevel("ERROR")
  import spark.implicits._

  // Assuming you have fixed structure. You can improvise this regex as required
  val regex = "([a-z-A-Z]+:\\s+)([\\s+\\d:\\./]+Z)([,\\s+a-z-A-Z]+:\\s+)([\\d\\.]+)([,\\s+]+[a-z-A-Z]+:)(.*)"
  /*
  ([a-z-A-Z]+:\\s+) --> Matches rowtime:
  ([\\s+\\d:\\./]+Z)  --> Matches rowtime value eg. 2020/06/11 10:38:42.449 Z
  ([,\\s+a-z-A-Z]+:\\s+) --> Matches , key:
  ([\\d\\.]+)  --> Matches content of key e.g 222.90.225.227
  ([,\\s+]+[a-z-A-Z]+:)  --> Matches , value:
  (.*) --> Matches content of value field which is in json
   */

  // Read file as dataframe and using regex and grouping ID extract column content
  var df = spark.read.textFile("sample.txt")
    .select(regexp_extract('value, regex, 2).as("rowtime"),
      regexp_extract('value, regex, 4).as("key"),
      regexp_extract('value, regex, 6).as("value"))

  // Since value is json we can make use from_json to create struct field
  df = df.withColumn("value", from_json('value, schema_of_json(df.select("value").first().getString(0))))

  // select all the column including nested columns of value column
  df.select("rowtime", "key", "value.*").show(false)
+-------------------------+---------------+-----+-------------------------------------------------------------------------------------------------------------------+-----+---------------+--------+-----------+-----------------------------------+------+----+------+
|rowtime                  |key            |_time|agent                                                                                                              |bytes|ip             |referrer|remote_user|request                            |status|time|userid|
+-------------------------+---------------+-----+-------------------------------------------------------------------------------------------------------------------+-----+---------------+--------+-----------+-----------------------------------+------+----+------+
|2020/06/11 10:38:42.449 Z|222.90.225.227 |1    |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|1289 |222.90.225.227 |-       |-          |GET /images/logo-small.png HTTP/1.1|302   |1   |12    |
|2020/06/11 10:38:42.528 Z|111.245.174.248|11   |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|14096|111.245.174.248|-       |-          |GET /site/login.html HTTP/1.1      |302   |11  |30    |
|2020/06/11 10:38:42.705 Z|122.152.45.245 |21   |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|4196 |122.152.45.245 |-       |-          |GET /images/logo-small.png HTTP/1.1|407   |21  |11    |
+-------------------------+---------------+-----+-------------------------------------------------------------------------------------------------------------------+-----+---------------+--------+-----------+-----------------------------------+------+----+------+

Answer 2

您的数据几乎看起来像 JSON，但缺少一些字段的双引号和最终的花括号包装器。

尝试以下 - 更多 spark-sql 方法

使用硬编码字符串模拟 dataframe。 你可以从文件中得到这个。

val df = Seq("""rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"} """,
            """rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}""",
            """rowtime: 2020/06/11 10:38:42.705 Z, key: 122.152.45.245, value: {"ip":"122.152.45.245","userid":11,"remote_user":"-","time":"21","_time":21,"request":"GET /images/logo-small.png HTTP/1.1","status":"407","bytes":"4196","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}""" 
            ).toDF("x")

创建临时视图

df.show(false)
df.createOrReplaceTempView("df")

输入数据：

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x                                                                                                                                                                                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}   |
|rowtime: 2020/06/11 10:38:42.528 Z, key: 111.245.174.248, value: {"ip":"111.245.174.248","userid":30,"remote_user":"-","time":"11","_time":11,"request":"GET /site/login.html HTTP/1.1","status":"302","bytes":"14096","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}   |
|rowtime: 2020/06/11 10:38:42.705 Z, key: 122.152.45.245, value: {"ip":"122.152.45.245","userid":11,"remote_user":"-","time":"21","_time":21,"request":"GET /images/logo-small.png HTTP/1.1","status":"407","bytes":"4196","referrer":"-","agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

现在使用正则表达式函数转换单列“x”以修复双引号问题和花括号，使其成为有效的 json 文字

val df1 = spark.sql("""
with t1 ( select x from df ),
     t2 ( select regexp_replace(x,"(rowtime|key|value):","\"$1\":") x from t1 ),
     t3 ( select regexp_replace(x,"(\"rowtime\":)\\s+([^,]+),","$1 \"$2\",") x from t2 ),
     t4 ( select regexp_replace(x,"(\"key\":)\\s+([^,]+),","$1 \"$2\",") x from t3 )
     select '{'||x||'}' y from t4  
""")
df1.printSchema()

现在，“y”列的每一行都是有效的 json 文字。 使用具有 json 字符串的单列转换此 dataframe，步骤如下 dataframe

import spark.implicits._ 
val df2 = spark.read.json(df1.as[String])

df2.printSchema

root
 |-- key: string (nullable = true)
 |-- rowtime: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- _time: long (nullable = true)
 |    |-- agent: string (nullable = true)
 |    |-- bytes: string (nullable = true)
 |    |-- ip: string (nullable = true)
 |    |-- referrer: string (nullable = true)
 |    |-- remote_user: string (nullable = true)
 |    |-- request: string (nullable = true)
 |    |-- status: string (nullable = true)
 |    |-- time: string (nullable = true)
 |    |-- userid: long (nullable = true)

在此之上创建一个视图

df2.createOrReplaceTempView("df2")

现在，使用 spark-sql 并获取 output 所需的元素。

spark.sql("""
select rowtime , key, value.ip ip, value.userid userid, value.remote_user remote_user, 
value.time time, value._time _time, value.request request, value.status status, value.bytes bytes, 
value.referrer referrer, value.agent agent
from df2
""").show(false)

Output：

+-------------------------+---------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------------------------------------------------------+
|rowtime                  |key            |ip             |userid|remote_user|time|_time|request                            |status|bytes|referrer|agent                                                                                                              |
+-------------------------+---------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------------------------------------------------------+
|2020/06/11 10:38:42.528 Z|111.245.174.248|111.245.174.248|30    |-          |11  |11   |GET /site/login.html HTTP/1.1      |302   |14096|-       |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|
|2020/06/11 10:38:42.528 Z|111.245.174.248|111.245.174.248|30    |-          |11  |11   |GET /site/login.html HTTP/1.1      |302   |14096|-       |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|
|2020/06/11 10:38:42.705 Z|122.152.45.245 |122.152.45.245 |11    |-          |21  |21   |GET /images/logo-small.png HTTP/1.1|407   |4196 |-       |Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36|
+-------------------------+---------------+---------------+------+-----------+----+-----+-----------------------------------+------+-----+--------+-------------------------------------------------------------------------------------------------------------------+

阅读非结构化文本文件后，如何在Spark Scala中使用正则表达式将RDD转换为Dataframe？

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-08-21 09:55:24

解决方案2
1 2021-08-21 18:01:29

阅读非结构化文本文件后，如何在Spark Scala中使用正则表达式将RDD转换为Dataframe？

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-08-21 09:55:24

解决方案2 1 2021-08-21 18:01:29

解决方案1
3 已采纳 2021-08-21 09:55:24

解决方案2
1 2021-08-21 18:01:29