[英]Getting data from an avro file for every row in a spark dataframe column
[英]Convert dataset to dataframe from an avro file
我編寫了一個 scala 腳本來加載 avro 文件,並使用生成的數據(以檢索主要貢獻者)。 問題是,在加載文件時,它提供了一個我無法轉換為 dataframe 因為它包含一些復雜類型的數據集:
val history_src = "path_to_avro_files\\frwiki*.avro"
val revisions_dataset = spark.read.format("avro").load(history_src)
//gives a dataset the we can see the data and make a take(1) without problems
val first_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[Revision]].array
.map(x=> (x.r_contributor.r_username, x.r_contributor.r_contributor_id, x.r_contributor.r_contributor_ip)))).take(1)
//gives GenericRowWithSchema cannot be cast to Revision
val second_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].toStream
.map(x=> (x.getLong(0),row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].map(c => (c.getLong(0))))))).take(1)
// gives WrappedArray$ofRef cannot be cast to scala.collection.mutable.ArrayBuffer
我使用下面的案例類嘗試使用編碼器和編碼器,但沒有用
case class History (title: String, namespace: Long, id: Long, revisions: Array[Revision])
case class Contributor (r_username: String, r_contributor_id: Long, r_contributor_ip: String)
case class Revision(r_id: Long, r_parent_id: Long, timestamp : Long, r_contributor: Contributor, sha: String)
我可以從我的 revisions_dataset 生成模式是這樣的,它給出了這個:
root
|-- p_title: string (nullable = true)
|-- p_namespace: long (nullable = true)
|-- p_id: long (nullable = true)
|-- p_revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- r_id: long (nullable = true)
| | |-- r_parent_id: long (nullable = true)
| | |-- r_timestamp: long (nullable = true)
| | |-- r_contributor: struct (nullable = true)
| | | |-- r_username: string (nullable = true)
| | | |-- r_contributor_id: long (nullable = true)
| | | |-- r_contributor_ip: string (nullable = true)
| | |-- r_sha1: string (nullable = true)
我的目標是有一個 dataframe 能夠檢索修訂列表上的貢獻者列表並將其展平以在頁面內有一個貢獻者列表(與標題具有相同的級別)。
有什么幫助嗎?
import org.apache.spark.sql.functions._
val r1 = Revision(1, 1, 1, Contributor("c1", 1, "ip1"), "sha")
val r2 = Revision(1, 1, 1, Contributor("c2", 2, "ip2"), "sha")
val r3 = Revision(1, 1, 1, Contributor("c3", 3, "ip3"), "sha")
val revisions_dataset = Seq(
("title1", 0L, 1L, Array(r1, r2)),
("title1", 0L, 2L, Array(r1, r3)),
("title1", 0L, 3L, Array(r2))
).toDF("p_title", "p_namespace", "p_id", "p_revisions")
val flattened = revisions_dataset.select($"p_title", $"p_id", explode($"p_revisions").alias("p_revision"))
.withColumn("r_contributor_username", $"p_revision.r_contributor.r_username")
.withColumn("r_contributor_id", $"p_revision.r_contributor.r_contributor_id")
.withColumn("r_contributor_ip", $"p_revision.r_contributor.r_contributor_ip")
.drop("p_revision")
flattened.show(false)
Output:
+-------+----+----------------------+----------------+----------------+
|p_title|p_id|r_contributor_username|r_contributor_id|r_contributor_ip|
+-------+----+----------------------+----------------+----------------+
|title1 |1 |c1 |1 |ip1 |
|title1 |1 |c2 |2 |ip2 |
|title1 |2 |c1 |1 |ip1 |
|title1 |2 |c3 |3 |ip3 |
|title1 |3 |c2 |2 |ip2 |
+-------+----+----------------------+----------------+----------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.