简体   繁体   English

Spark Sql Dataframe在一个字段上联接

[英]Spark Sql Dataframe Join on one field

I am very new to Spark. 我是Spark的新手。 I have below queries --> 我有以下查询->

I have 2 tables. 我有2张桌子。 Business and Inspections. 业务和检查。 Business Table has fields -> Business_id, name, address Inspections table has --> score I want to calculate top 10 scores. 业务表具有字段-> Business_id,名称,地址检查表具有->得分我想计算前10个得分。 So, I need to join based on Business_id filed. 因此,我需要基于提交的Business_id加入。 I tried 2 ways but none of them working --> 1) Using sqlContext.sql (I wrote sql query) 我尝试了2种方法,但都不起作用-> 1)使用sqlContext.sql(我编写了sql查询)

1)sqlContext.sql("""select CBusinesses.BUSINESS_ID,CBusinesses.name,  CBusinesses.address, CBusinesses.city, CBusinesses.postal_code, CBusinesses.latitude, CBusinesses.longitude, Inspections_notnull.score  from CBusinesses, Inspections_notnull where CBusinesses.BUSINESS_ID=Inspections_notnull.BUSINESS_ID and Inspections_notnull.score <>0 order by Inspections_notnull.score""").show()

2) val df = businessesDF.join(raw_inspectionsDF, businessesDF.col("BUSINESS_ID") == raw_inspectionsDF.col("BUSINESS_ID"))

How should I write it? 我应该怎么写? Thanks! 谢谢!

val df = businessesDF.join(raw_inspectionsDF, businessesDF("BUSINESS_ID") === raw_inspectionsDF("BUSINESS_ID"))

这应该可行,请在此处查看更多详细信息: https : //spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html

Sure... I created case class for each dataset then split it by tab then converted rdd to dataframe 当然...我为每个数据集创建了案例类,然后按制表符将其拆分,然后将rdd转换为数据框

import sqlContext. implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util.{Try, Success, Failure}

def parseScore(s: String): Option[Int] = {                                  
  Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}

case class CInspections (business_id:Int, score:Option[Int], date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))   
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections =>        CInspections (raw_inspections(0).toInt,parseScore(raw_inspections(1)),         raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//raw_inspectionsDF.show()
val raw_inspectionsDF_replacenull = raw_inspectionsDF.na.fill(0)     //  Replacing null values with '0'
raw_inspectionsDF_replacenull.show()
raw_inspectionsDF_replacenull.createOrReplaceTempView     ("Inspections_notnull")


For Business --> 
  case class CBusinesses (business_id:Int, name: String, address:String,     city:String, postal_code:Int, latitude:String, longitude:String, phone_number:String, tax_code:String, business_certificate:String, application_date:String,owner_name:String, owner_address:String, owner_city:String, owner_state:String,  owner_zip:String )  
val businesses = sc.textFile (s"$baseDir/businesses_plus.txt")
val businessesmap = businesses.map ( line => line.split ("\t"))
val businessesRDD = businessesmap.map ( businesses => CBusinesses (businesses(0).toInt, businesses(1),      businesses(2),businesses(3),businesses(4).toInt,
businesses(5),businesses(6), businesses(7), businesses(8), businesses(9),     businesses(10), businesses(11), businesses(12), businesses(13), businesses(14),     businesses(15)))
 val businessesDF = businessesRDD.toDF
 businessesDF.createOrReplaceTempView ("CBusinesses")
 businessesDF.printSchema
//businessesDF.show()

 It is showing proper resiult for both dataframe
 For Inspection -->
  +-----------+-----+--------+--------------------+
  |business_id|score|    date|               type1|
  +-----------+-----+--------+--------------------+
  |         10|    0|20140807|Reinspection/Foll...|
 |         10|   94|20140729|Routine - Unsched...|
 |         10|    0|20140124|Reinspection/Foll...|
 |         10|   92|20140114|Routine - Unsched...|

For Business -->
+-----------+--------------------+--------------------+-------------+-----------+---------+-----------+------------+--------+--------------------+----------------+--------------------+--------------------+-----------------+-------------+---------+
|business_id|                name|             address|             city|postal_code| latitude|  longitude|phone_number|tax_code|business_certificate|application_date|           owner_name|       owner_address|       owner_city|  owner_state|owner_zip|
+-----------+--------------------+--------------------+-------------+-------    ----+---------+-----------+------------+--------+--------------------+----------    ------+--------------------+--------------------+-----------------+-------------    +---------+
|         10|    Tiramisu Kitchen|       033 Belden Pl|San Francisco|          94104|37.791116|-122.403816|            |     H24|              779059|                    |        Tiramisu LLC|        33 Belden St|    San Francisco|           CA|        94104|
|         17|GEORGE'S COFFEE SHOP|   2200 OAKDALE Ave |         S.F.|          94124|37.741086|-122.401737| 14155531470|     H24|               78443|              4/5/75|"LIEUW, VICTOR & ...| 648 MACARTHUR DRIVE|        DALY CITY|                  CA|    94015|

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM