简体   繁体   English

使用左外部联接火花无法联接数据帧

[英]spark cant join dataframes using left outer join

I am trying to join 2 dataframes based on df1.portfolio name to df2.portId the resulting dataframe I do not want the same key repeated. 我正在尝试将基于df1.portfolio名称的2个数据框加入df2.portId,从而导致结果数据框不希望重复相同的键。

here is my code so far 到目前为止,这是我的代码

val df = spark.read.json("C:\\json\\portmast") 
val pgetsec = spark.read.json("C:\\json\\pgetsec")


val portfolio_master = df.select("PortfolioCode","Legal Entity Name","Asofdate")
val pgetsecs= pgetsec.select("TransId", "SecId","portId","GaapCurBkBal","ParBal","SetlDt","SetlPric","OrgBkBal","TradeDt","StatCurBkBal","NaicRtg","SecurityTypeCode","CamraSecType","FundType","CountryIso")
val pg = portfolio_master.join(pgetsec,Seq("PortfolioCode","portId"),"left_outer")

the error I am getting is 我得到的错误是
Exception in thread "main" org.apache.spark.sql.AnalysisException: using columns ['PortfolioCode,'portId] can not be resolved given input columns: Final json should look like this Exception in thread "main" org.apache.spark.sql.AnalysisException: using columns ['PortfolioCode,'portId] can not be resolved given input columns:最终json应该看起来像这样

|-- Portfolio Code: string (nullable = true)
|-- Legal Entity Name: string (nullable = true)
|-- Asofdate: string (nullable = true)

((SI, S&P 500 Index,9/30/2016),[0.0,Equity,Common Stock])
((SI, S&P 500 Index,9/30/2016),[0.0,Equity,Common Stock])
((SI, S&P 500 Index,9/30/2016),[0.0,Equity,Common Stock])
[SI1, S&P 500 Index,9/30/2016,CompactBuffer([0.0,Equity,Common     Stock],    [0.0,Equity,Common Stock], [0.0,Equity,Common Stock])]
root
|-- Portfolio Code: string (nullable = true)
|-- Legal Entity Name: string (nullable = true)
|-- Asofdate: string (nullable = true)
|-- Security: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- BondPrice: double (nullable = true)
|    |    |-- CoreSectorLevel1Code: string (nullable = true)
|    |    |-- CoreSectorLevel2Code: string (nullable = true)

+--------------+-------------------+---------+--------------------+
|Portfolio Code|  Legal Entity Name| Asofdate|            Security|
+--------------+-------------------+---------+--------------------+
|           SI | S&P 500 Index     |9/30/2016|[[0.0,Equity,Comm...|
+--------------+-------------------+---------+--------------------+

Any help is appreciated. 任何帮助表示赞赏。

portId doesn't exist in portfolio_master and PortfolioCode doesn't exist in pgetsec . portId不存在中portfolio_masterPortfolioCode不存在pgetsec If you reread the full error message you'll see it explains this as it also shows the available columns. 如果您重新阅读完整的错误消息,则会看到它的解释,因为它还会显示可用的列。

What you want is portfolio_master("PortfolioCode") === pgetsec("portId") as your join condition. 您想要的是portfolio_master("PortfolioCode") === pgetsec("portId")作为您的加入条件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM