[英]Turn textfile into dataframe with Scala Spark
I have a text file from S3 (actually multiple .gz files) and I wrote the code below我有一个来自 S3 的文本文件(实际上是多个 .gz 文件),我写了下面的代码
val text = sc.textFile(path)
val df_text = text.map(row => row.split(",")).toDF()
But the result is like that但结果是这样的
+--------------------+
| value|
+--------------------+
|[id, member_id, l...|
|[1077501, 1296599...|
|[1077430, 1314167...|
|[1077175, 1313524...|
|[1076863, 1277178...|
|[1075358, 1311748...|
|[1075269, 1311441...|
+--------------------+
I can't read it like "val df = spark.read.format("csv").option("header", "true").load(path)" because when I read it like that it can't find header:我不能像 "val df = spark.read.format("csv").option("header", "true").load(path)" 那样读它,因为当我这样读时它找不到标题:
+-----------+-----------+-----------+
|1077430 |1356730 |4525526 |...
+-----------+-----------+-----------+
| 41173430| 1356730| 1456430|...
| 10237430| 1356660| 1463750|...
+-----------+-----------+-----------+
How can I make it a proper DataFrame?我怎样才能使它成为一个合适的 DataFrame?
In spark 2.4.0 with scala 2.12.8.在带有 Scala 2.12.8 的 spark 2.4.0 中。
It's very easy:这很容易:
val spark: SparkSession = SparkSession
.builder
.master("local[*]")
.getOrCreate
val sc = spark.sparkContext
val myGZs= sc
.textFile("s3://route//*.gz")
.map(parseToObject)
.filter(obj => obj != null)
val myGZsDF = spark.createDataFrame(myGZs)
myGZsDF.printSchema()
where parseToObject
is a function like:其中
parseToObject
是一个函数,如:
val parseToObject = (row: String) => {
if (row is header) { //program this
null
}
val split_row = row.split(",")
Subscription(split_row[0].toLong, split_row[1].toLong ...)
}
An example of Subscription
case class: Subscription
案例类的一个例子:
case class Subscription(id: Long, memberId: Long ...)
Both map
and filter
are narrow transformations! map
和filter
都是窄变换!
EDIT:编辑:
Have also seen this link from @kev on how to read multiple GZ files and convert it to a DF.也看到了这个链接从@kev如何读取多个GZ文件,并将其转换为一个DF。 Beware of the extension, it MUST be
.gz
.注意扩展名,它必须是
.gz
。
Hope this helps.希望这可以帮助。 Let me know if you have any problem.
如果您有任何问题,请告诉我。 Tomás.
托马斯。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.