简体   繁体   中英

spark a single dataframe convert to two tables with id and a relationship of one to many

The data source is a csv:

name,companyName
shop1,com1
shop2,com1
shop3,com1
shop4,com2
shop5,com2
shop6,com3

I use spark to read it into a dataframe,and want to convert it to two dataframes in one to many relationship,the expected output are two dataframes. One is companyDF:

companyId,companyName
1,com1
2,com2
3,com3

another is shopDF:

shopId, shopName,   shopCompanyId
1,shop1,1
2,shop2,1
3,shop3,1
4,shop4,2
5,shop5,2
6,shop6,3

these two dataframes can join on shopDF.shopCompanyId=companyDF.companyId and getData.I used monotonically_increasing_id() to generate id like 1 2 3 4.. or there is a better way

I have write some code to do it,and works

package delme
import com.qydata.fuyun.util.Utils;
import scala.reflect.api.materializeTypeTag
import java.io.BufferedWriter
import java.io.InputStream
import java.io.InputStreamReader
import java.io.FileInputStream
import java.io.BufferedReader
import scala.util.control.Exception.Finally
import java.io.IOException
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object OneToMany {
  def main(args: Array[String]){ 
    var sparkConf = new SparkConf().setMaster("local[*]")
     val builder =  SparkSession.builder().config(sparkConf)//.enableHiveSupport()  
     val ss =  builder.getOrCreate()  
     val sc = ss.sparkContext
     import ss.implicits._
      var shopdf = ss.read.format("csv")
              .option("header", "true")
            .option("inferSchema", "true")
            .load("data/shops.csv") 
      val companydf=shopdf.select("companyName").distinct().withColumn("companyId", monotonically_increasing_id())
      companydf.show()
      shopdf=shopdf.join(companydf,shopdf.col("companyName")===companydf.col("companyName"),"left_outer").select("name", "companyName","companyId")
      shopdf.show()
  }
}

but I feel it is stupid,I want handle it just one time,not a "distinct" and a "join",first the operator on String maybe low efficiency,second if I have incremental data,I don`t how to handle it.for example,another batch of data is:

name,companyName
shop1a,com1
shop2a,com1
shop3a,com1
shop4a,com2
shop5a,com2
shop6a,com3
shop7,com4

And I want to append these to the old table,(I save the data to hive tables before in fact),I have no idea then. Here the new company com4 with an id 4 should be append to company table,and (13,shop7,4) will be append to shop table

so how to convert the source to two dataframes?

val df1=df.withColumn(“companyId”,dense_rank.over(Window.orderBy(“companyName”))).withColumn(“shopId”,row_number.over(Window.orderBy(“name”)))
val companydf = df1.select(“companyName”,”companyId”).dropDuplicates
val shopdf= df1.select(col(“name”).alias(“shopName”,col(“shopId”),col(“companyId”).alias(“shopCompanyId”))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM