简体   繁体   English

使用 Scala Spark 将文本文件转换为数据框

[英]Turn textfile into dataframe with Scala Spark

I have a text file from S3 (actually multiple .gz files) and I wrote the code below我有一个来自 S3 的文本文件(实际上是多个 .gz 文件),我写了下面的代码

val text = sc.textFile(path)
val df_text = text.map(row => row.split(",")).toDF()

But the result is like that但结果是这样的

+--------------------+
|               value|
+--------------------+
|[id, member_id, l...|
|[1077501, 1296599...|
|[1077430, 1314167...|
|[1077175, 1313524...|
|[1076863, 1277178...|
|[1075358, 1311748...|
|[1075269, 1311441...|
+--------------------+

I can't read it like "val df = spark.read.format("csv").option("header", "true").load(path)" because when I read it like that it can't find header:我不能像 "val df = spark.read.format("csv").option("header", "true").load(path)" 那样读它,因为当我这样读时它找不到标题:

+-----------+-----------+-----------+
|1077430    |1356730    |4525526    |...
+-----------+-----------+-----------+
|   41173430|    1356730|    1456430|...
|   10237430|    1356660|    1463750|...
+-----------+-----------+-----------+

How can I make it a proper DataFrame?我怎样才能使它成为一个合适的 DataFrame?

In spark 2.4.0 with scala 2.12.8.在带有 Scala 2.12.8 的 spark 2.4.0 中。

It's very easy:这很容易:

val spark: SparkSession = SparkSession
      .builder
      .master("local[*]")
      .getOrCreate
val sc = spark.sparkContext

val myGZs= sc
      .textFile("s3://route//*.gz")
      .map(parseToObject)
      .filter(obj => obj != null)

val myGZsDF = spark.createDataFrame(myGZs)
myGZsDF.printSchema()

where parseToObject is a function like:其中parseToObject是一个函数,如:

val parseToObject = (row: String) => {
   if (row is header) { //program this
      null
   }
   val split_row = row.split(",")
   Subscription(split_row[0].toLong, split_row[1].toLong ...)
}

An example of Subscription case class: Subscription案例类的一个例子:

case class Subscription(id: Long, memberId: Long ...)

Both map and filter are narrow transformations! mapfilter都是窄变换!

EDIT:编辑:

Have also seen this link from @kev on how to read multiple GZ files and convert it to a DF.也看到了这个链接从@kev如何读取多个GZ文件,并将其转换为一个DF。 Beware of the extension, it MUST be .gz .注意扩展名,它必须是.gz

Hope this helps.希望这可以帮助。 Let me know if you have any problem.如果您有任何问题,请告诉我。 Tomás.托马斯。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM