使用 Spark 更新 SQL 服务器数据库中的表，其中包含 HIVE 中的数据

Question

I have my master table in SQL Server and I want to update a few columns in the table based on the conditions that 3 columns matches between my master table (in SQL server DB) and target table(in HIVE).我在 SQL 服务器中有我的主表，我想根据我的主表（在 SQL 服务器数据库中）和目标表（在 HIVE 中）3 列匹配的条件更新表中的几列。 Both tables have multiple columns but I'm only interested in 6 columns as highlighted below:两个表都有多个列，但我只对下面突出显示的 6 列感兴趣：

The 3 columns that I want to update in my master table are我想在主表中更新的 3 列是

"INSPECTED_BY", "INSPECTION_COMMENTS" and "SIGNED_BY"

The columns that I want to use as my matching condition are我想用作匹配条件的列是

"SERVICE_NUMBER", "PART_ID" and "LOTID"

I tried the below code but it's giving me a NullPointerException error我尝试了下面的代码，但它给了我一个 NullPointerException 错误

val df = spark.table("location_of_my_table_in_hive")
df.show(false)
df.foreachPartition(partition => 
{
    val Connection = DriverManager.getConnection(SQLjdbcURL, SQLusername, SQLPassword)
    val batch_size = 100
    var psmt: PreparedStatement = null 

    partition.grouped(batch_size).foreach(batch => 
    {
        batch.foreach{row => 
            {
                val inspctbyIndex = row.fieldIndex("INSPECTED_BY")
                val inspctby = row.getString(inspctbyIndex)
        
                val inspcomIndex = row.fieldIndex("INSPECT_COMMENTS")
                val inspcom = row.getString(inspcomIndex)
        
                val signIndex = row.fieldIndex("SIGNED_BY")
                val signby = row.getString(signIndex)
        
                val sqlquery = "MERGE INTO SERVICE_LOG_TABLE as LOG" +
                    "USING (VALUES(?, ?, ?))" +
                    "AS ROW(inspctby, inspcom, signby)" +
                    "ON LOG.SERVICE_NUMBER = ROW.SERVICE_NUMBER and LOG.PART_ID = ROW.PART_ID and LOG.LOTID = ROW.LOTID" +
                    "WHEN MATCHED THEN UPDATE SET INSPECTED_BY = 'SMITH', INSPECT_COMMENTS = 'STANDARD_MET', SIGNED_BY = 'WILL'" +
                    "WHEN NOT MATCHED THEN INSERT VALUES(ROW.INSPECTED_BY, ROW.INSPECT_COMMENTS, ROW.SIGNED_BY)"
                var psmt: PreparedStatement = Connection.prepareStatement(sqlquery)
        
                psmt.setString(1, inspctby)
                psmt.setString(2, inspcom)
                psmt.setString(3, signby)
                psmt.addBatch()
            }   
        }
        psmt.executeBatch()
        Connection.commit()
        psmt.close()
    })
    Connection.close()
})

Here is the error这是错误

ERROR scheduler.TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 2.0 (TID 9, lwtxa0gzpappr.corp.bankofamerica.com, 
executor 4): java.lang.NullPointerException
    at $anonfun$1$$anonfun$apply$1.apply(/location/service_log.scala:101)
    at $anonfun$1$$anonfun$apply$1.apply(/location/service_log.scala:74)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at $anonfun$1.apply(/location/service_log.scala:74)
    at $anonfun$1.apply(/location/service_log.scala:68)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I searched the inte.net and could not find the reason why the error is coming.我搜索了 inte.net 并找不到错误出现的原因。 Any help would be appreciated任何帮助，将不胜感激

Answer 1

If you are running this on spark cluster,I think you might have to broadcast some object. Executors not are able to get the value of the object so the null pointer exception.如果您在 spark 集群上运行它，我认为您可能必须广播一些 object。执行者无法获得 object 的值，因此 null 指针异常。

使用 Spark 更新 SQL 服务器数据库中的表，其中包含 HIVE 中的数据

问题描述

1 个解决方案

解决方案1
0 2020-09-02 07:29:30

使用 Spark 更新 SQL 服务器数据库中的表，其中包含 HIVE 中的数据

问题描述

1 个解决方案

解决方案1 0 2020-09-02 07:29:30

解决方案1
0 2020-09-02 07:29:30