简体   繁体   English

如何在 Spark 中正确连接两个数据帧

[英]How to correctly join two dataframes in Spark

Given these datasets:鉴于这些数据集:

productsMetadataDF产品元数据DF

{'asin': '0006428320', 'title': 'Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)', 'price': 17.95, 'imUrl': 'http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg', 'salesRank': {'Musical Instruments': 207315}, 'categories': [['Musical Instruments', 'Instrument Accessories', 'General Accessories', 'Sheet Music Folders']]}

productsRatingsDF产品评级DF

{"reviewerID": "AORCXT2CLTQFR", "asin": "0006428320", "reviewerName": "Justo Roteta", "helpful": [0, 0], "overall": 4.0, "summary": "Not a classic but still a good album from Yellowman.", "unixReviewTime": 1383436800, "reviewTime": "11 3, 2013"}

and this function:和这个功能:

def findProductFeatures(productsRatingsDF : DataFrame, productsMetadataDF : DataFrame) : DataFrame = {
    productsRatingsDF
      .withColumn("averageRating", avg("overall"))
      .join(productsMetadataDF,"asin")
      .select($"asin", $"categories", $"price", $"averageRating")
  }

Would this be the correct way to join these two datasets, based on the asin?基于 asin,这是连接这两个数据集的正确方法吗?

Here's the error I get:这是我得到的错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`asin`' is not an aggregate function. Wrap '(avg(`overall`) AS `averageRating`)' in windowing function(s) or wrap '`asin`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [asin#6, helpful#7, overall#8, reviewText#9, reviewTime#10, reviewerID#11, reviewerName#12, summary#13, unixReviewTime#14L, avg(overall#8) AS averageRating#99]
+- Relation[asin#6,helpful#7,overall#8,reviewText#9,reviewTime#10,reviewerID#11,reviewerName#12,summary#13,unixReviewTime#14L] json

Do I understand the error correct, there's an error in the way I join?我理解的错误正确吗,我加入的方式有问题吗? I tried changing the order of the .withColumn and .join, but it didn't work.我尝试更改 .withColumn 和 .join 的顺序,但是没有用。 There also seems to be an error when I try to input avg("overall") into a column, based on the asin number.当我尝试根据 asin 编号将 avg("overall") 输入到列中时,似乎也出现了错误。

The end result should be, that I get a dataframe of 4 columns "asin", "categories" ,"price", and "averageRating".最终结果应该是,我得到一个包含 4 列“asin”、“categories”、“price”和“averageRating”的数据框。

The problem seems to be :问题似乎是:

.withColumn("averageRating", avg("overall"))

Do a proper aggregation before joining:在加入之前做一个适当的聚合:

df
.groupBy("asin") // your columns
.agg(avg("overall").as("averageRating"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM