[英]Batching of Dataset Spark scala
我正在嘗試在 Spark 中創建成批的Dataset
行。 為了保持發送到服務的記錄數量,我想對項目進行批處理,以便我可以保持發送數據的速率。 為了,
case class Person(name:String, address: String)
case class PersonBatch(personBatch: List[Person])
對於給定的Dataset[Person]
我想創建Dataset[PersonBatch]
例如,如果輸入Dataset[Person]
有 100 條記錄,則輸出Dataset
應該類似於Dataset[PersonBatch]
,其中每個PersonBatch
應該是n
條記錄(Person)的列表。
我試過這個,但它不起作用。
object DataBatcher extends Logger {
var batchList: ListBuffer[PersonBatch] = ListBuffer[PersonBatch]()
var batchSize: Long = 500 //default batch size
def addToBatchList(batch: PersonBatch): Unit = {
batchList += batch
}
def clearBatchList(): Unit = {
batchList.clear()
}
def createBatches(ds: Dataset[Person]): Dataset[PersonBatch] = {
val dsCount = ds.count()
logger.info(s"Count of dataset passed for creating batches : ${dsCount}")
val batchElement = ListBuffer[Person]()
val batch = PersonBatch(batchElement)
ds.foreach(x => {
batch.personBatch += x
if(batch.personBatch.length == batchSize) {
addToBatchList(batch)
batch.requestBatch.clear()
}
})
if(batch.personBatch.length > 0) {
addToBatchList(batch)
batch.personBatch.clear()
}
sparkSession.createDataset(batchList)
}
}
我想在 Hadoop 集群上運行這個作業。 有人能幫我解決這個問題嗎?
rdd.iterator 的分組功能可能對你有用。
例如 :
iter.grouped(batchSize)
使用 iter.grouped(batchsize) 批量插入的示例代碼片段,這里是 1000,我試圖插入到數據庫中
df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
sqlTableName: String): Unit = {
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
//NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
val sqlExecutorConnection: Connection =
DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach { group =>
val insertString: scala.collection.mutable.StringBuilder =
new scala.collection.mutable.StringBuilder()
group.foreach { record =>
insertString.append("('" + record.mkString(",") + "'),")
}
sqlExecutorConnection
.createStatement()
.executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
+ insertString.stripSuffix(","))
}
sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
}
}
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition((it: Iterator[Row]) => {
println("partition index: " )
val url = "jdbc:...+ "user=;password=;"
val conn = DriverManager.getConnection(url)
conn.setAutoCommit(true)
val stmt = conn.createStatement()
val batchSize = 10
var i =0
while (it.hasNext) {
val row = it.next
import java.sql.SQLException
import java.sql.SQLIntegrityConstraintViolationException
try {
stmt.addBatch(" UPDATE TABLE SET STATUS = 0 , " +
" DATE ='" + new java.sql.Timestamp(System.currentTimeMillis()) +"'" +
" where id = " + row.getAs("IDNUM") )
i += 1
if ( i % batchSize == 0 ) {
stmt.executeBatch
conn.commit
}
} catch {
case e: SQLIntegrityConstraintViolationException =>
case e: SQLException =>
e.printStackTrace()
}
finally {
stmt.executeBatch
conn.commit
}
}
import java.util
val ret = stmt.executeBatch
System.out.println("Ret val: " + util.Arrays.toString(ret))
System.out.println("Update count: " + stmt.getUpdateCount)
conn.commit
stmt.close
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.