[英]Spark read key value pairs from the file into a Dataframe
I need to read a log file and convert it into a spark dataframe.我需要读取一个日志文件并将其转换为火花 dataframe。
Input File Content:输入文件内容:
dateCreated : 20200720
customerId : 001
dateCreated : 20200720
customerId : 002
dateCreated : 20200721
customerId : 003
Expected Dataframe:预期 Dataframe:
---------------------------
|dateCreated | customerId |
---------------------------
|20200720 | 001 |
|20200720 | 002 |
|20200721 | 003 |
|------------|------------|
Spark code:火花代码:
val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
val inputFile = "C:\\log_data.txt"
val rddFromFile = spark.sparkContext.textFile(inputFile)
val rdd = rddFromFile.map(f => {
f.split(":")
})
rdd.foreach(f => {
println(f(0) + "\t" + f(1))
})
Any idea on how to convert the above rdd to a required DF?关于如何将上述 rdd 转换为所需 DF 的任何想法?
Check below code.检查下面的代码。
scala> "cat /tmp/sample/input.csv".!
dateCreated : 20200720
customerId : 001
dateCreated : 20200720
customerId : 002
dateCreated : 20200721
customerId : 003
scala> val df = spark.read.text("/tmp/sample").select(split($"value",":").as("data"))
df: org.apache.spark.sql.DataFrame = [data: array<string>]
scala> df.show(false)
+---------------------------+
|data |
+---------------------------+
|[dateCreated , 20200720]|
|[customerId , 001] |
|[dateCreated , 20200720]|
|[customerId , 002] |
|[dateCreated , 20200721]|
|[customerId , 003] |
+---------------------------+
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val windowSpec = Window.orderBy($"id".asc)
scala> df
.select(trim($"data"(0)).as("data"),trim($"data"(1)).as("values"))
.select(map($"data",$"values").as("data"))
.select($"data"("dateCreated").as("dateCreated"),$"data"("customerId").as("customerId"))
.withColumn("id",monotonically_increasing_id)
.withColumn("customerId",lead($"customerId",1).over(windowSpec))
.where($"customerId".isNotNull)
.drop("id")
.show(false)
+-----------+----------+
|dateCreated|customerId|
+-----------+----------+
|20200720 |001 |
|20200720 |002 |
|20200721 |003 |
+-----------+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.