如何在更改數據框中的列順序時修復org.apache.spark.sql.AnalysisException？

Question

我正在嘗試將數據從Postgres的RDBMS表加載到HDFS的Hive表。

      val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                        .option("dbtable", s"(${query}) as year2017")
                        .option("user", devUserName).option("password", devPassword)
                        .option("numPartitions",15).load()

Hive表是基於兩列動態分區的： source_system_name,period_year我在元數據表中存在以下列名： metatables

val spColsDF = spark.read.format("jdbc").option("url",hiveMetaConURL)
                    .option("dbtable", "(select partition_columns from metainfo.metatables where tablename='finance.xx_gl_forecast') as colsPrecision")
                    .option("user", metaUserName)
                    .option("password", metaPassword)
                    .load()

我正在嘗試將分區列： source_system_name, period_year移到dataFrame： yearDF的末尾，因為在Hive動態分區中使用的列應位於末尾。 為此，我提出了以下邏輯：

val partition_columns      = spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
val allColsOrdered         = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols                = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF               = yearDF.select(allCols:_*)

執行代碼時，出現異常： org.apache.spark.sql.AnalysisException如下：

Exception in thread "main" 18/08/28 18:09:30 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
org.apache.spark.sql.AnalysisException: cannot resolve '`source_system_name,period_year`' given input columns: [cost_center, period_num, period_name, currencies, cc_channel, scenario, xx_pk_id, period_year, cc_region, reference_code, source_system_name, source_record_type, xx_last_update_tms, xx_last_update_log_id, book_type, cc_function, product_line, ptd_balance_text, project, ledger_id, currency_code, xx_data_hash_id, qtd_balance_text, pl_market, version, qtd_balance, period, ptd_balance, ytd_balance_text, xx_hvr_last_upd_tms, geography, year, del_flag, trading_partner, ytd_balance, xx_data_hash_code, xx_creation_tms, forecast_id, drm_org, account, business_unit, gl_source_name, gl_source_system_name];;
'Project [forecast_id#26L, period_year#27, period_num#28, period_name#29, drm_org#30, ledger_id#31L, currency_code#32, source_system_name#33, source_record_type#34, gl_source_name#35, gl_source_system_name#36, year#37, period#38, scenario#39, version#40, currencies#41, business_unit#42, account#43, trading_partner#44, cost_center#45, geography#46, project#47, reference_code#48, product_line#49, ... 20 more fields]
+- Relation[forecast_id#26L,period_year#27,period_num#28,period_name#29,drm_org#30,ledger_id#31L,currency_code#32,source_system_name#33,source_record_type#34,gl_source_name#35,gl_source_system_name#36,year#37,period#38,scenario#39,version#40,currencies#41,business_unit#42,account#43,trading_partner#44,cost_center#45,geography#46,project#47,reference_code#48,product_line#49,... 19 more fields] JDBCRelation((select forecast_id,period_year,period_num,period_name,drm_org,ledger_id,currency_code,source_system_name,source_record_type,gl_source_name,gl_source_system_name,year,period,scenario,version,currencies,business_unit,account,trading_partner,cost_center,geography,project,reference_code,product_line,book_type,cc_region,cc_channel,cc_function,pl_market,ptd_balance,qtd_balance,ytd_balance,xx_hvr_last_upd_tms,xx_creation_tms,xx_last_update_tms,xx_last_update_log_id,xx_data_hash_code,xx_data_hash_id,xx_pk_id,null::integer as del_flag,ptd_balance::character varying as ptd_balance_text,qtd_balance::character varying as qtd_balance_text,ytd_balance::character varying as ytd_balance_text from analytics.xx_gl_forecast where period_year='2017') as year2017) [numPartitions=1]

但是，如果我通過以下另一種方式傳遞相同的列名，則代碼可以正常工作：

val lastCols        = Seq("source_system_name","period_year")
val allColsOrdered  = yearDF.columns.diff(lastCols) ++ lastCols
val allCols         = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF        = yearDF.select(allCols:_*)

誰能告訴我我在這里做錯了什么？

Answer 1

如果您查看錯誤：

 cannot resolve '`source_system_name,period_year`

這意味着，以下行：

spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq

返回如下內容：

Array("source_system_name,period_year")

這意味着兩個列名稱都是串聯在一起的，並形成數組的第一個元素，而不是您想要的單獨元素。

要獲得期望的結果，你需要拆分它, 。 例如，以下應該起作用。

spColsDf.select("partition_columns").collect.flatMap(_.getAs[String](0).split(","))

如何在更改數據框中的列順序時修復org.apache.spark.sql.AnalysisException？

問題描述

1 個解決方案

解決方案1
1 已采納 2018-08-29 08:14:10

如何在更改數據框中的列順序時修復org.apache.spark.sql.AnalysisException？

問題描述

1 個解決方案

解決方案1 1 已采納 2018-08-29 08:14:10

解決方案1
1 已采納 2018-08-29 08:14:10