[英]Create spark data frame from custom data format
我有一个文本文件,其中String REC作为记录定界符,而换行符作为列定界符,每个数据都有附加的列名,以逗号作为定界符,以下是示例数据格式
记录
编号19048
期限,牛奶
等级1
记录
Id,19049
术语,玉米
等级5
使用REC作为记录定界符。现在,我想创建具有列名ID,术语和等级的Spark数据框架。请协助我。
这是工作代码
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}
object RecordSeparator extends App {
var conf = new
SparkConf().setAppName("test").setMaster("local[1]")
.setExecutorEnv("executor- cores", "2")
var sc = new SparkContext(conf)
val hconf = new Configuration
hconf.set("textinputformat.record.delimiter", "REC")
val data = sc.newAPIHadoopFile("data.txt",
classOf[TextInputFormat], classOf[LongWritable],
classOf[Text], hconf).map(x => x._2.toString.trim).filter(x => x != "")
.map(x => getRecord(x)).map(x => x.split(","))
.map(x => record(x(0), x(2), x(2)))
val sqlContext = new SQLContext(sc)
val df = data.toDF()
df.printSchema()
df.show(false)
def getRecord(in: String): String = {
val ar = in.split("\n").mkString(",").split(",")
val data = Array(ar(1), ar(3), ar(5))
data.mkString(",")
}
}
case class record(Id: String, Term: String, Rank: String)
输出:
root
|-- Id: string (nullable = true)
|-- Term: string (nullable = true)
|-- Rank: string (nullable = true)
+-----+----+----+
|Id |Term|Rank|
+-----+----+----+
|19048|1 |1 |
|19049|5 |5 |
+-----+----+----+
假设您的文件位于“普通”文件系统(而非HDFS)上,则必须编写一个文件解析器,然后使用sc.parallelize
创建一个RDD
,然后创建一个DataFrame
:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
object Demo extends App {
val conf = new SparkConf().setMaster("local[1]").setAppName("Demo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
case class Record(
var id:Option[Int] = None,
var term:Option[String] = None,
var rank:Option[Int] = None)
val filename = "data.dat"
val records = readFile(filename)
val df = sc.parallelize(records).toDF
df.printSchema()
df.show()
def readFile(filename:String) : Seq[Record] = {
import scala.io.Source
val records = mutable.ArrayBuffer.empty[Record]
var currentRecord: Record = null
for (line <- Source.fromFile(filename).getLines) {
val tokens = line.split(',')
currentRecord = tokens match {
case Array("REC") => Record()
case Array("Id", id) => {
currentRecord.id = Some(id.toInt); currentRecord
}
case Array("Term", term) => {
currentRecord.term = Some(term); currentRecord
}
case Array("Rank", rank) => {
currentRecord.rank = Some(rank.toInt); records += currentRecord;
null
}
}
}
return records
}
}
这给
root
|-- id: integer (nullable = true)
|-- term: string (nullable = true)
|-- rank: integer (nullable = true)
+-----+----+----+
| id|term|rank|
+-----+----+----+
|19048|milk| 1|
|19049|corn| 5|
+-----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.