简体   繁体   中英

Joining multiple tables into one de-normalized table

I'm looking to join multiple tables to get a resultant single de-normalized table. Below is one such scenario where I have 2 tables and an expected resultant table.

Table 1:

id       From Date                  To Date                  User
AA12345  02-Jan-2017 12:00:00 AM    08-Jan-2017 11:59:59 PM  LL7R
AA12345  09-Jan-2017 12:00:00 AM    14-Feb-2017 11:59:59 PM  AT3B
AA12345  15-Feb-2017 12:00:00 AM    31-Dec-3030 11:59:59 PM  UJ5G

Table 2:

id                                                           Associated id
AA12345  06-Jan-2017 12:00:00 AM    23-Jan-2017 11:59:59 AM  AA12345, AA234567
AA12345  24-Jan-2017 12:00:00 AM    31-Dec-3030 11:59:59 PM  AA12345, AA234567, AB56789

Notice that id values are same in both tables. Consider these as event tables where there are different events happening at various time periods. So the resultant final table should have all the events without any overlaps between From and To dates. In case there is an overlap between 'From Date' and 'To Date' as you see in this example ('To Date' of Table 1's 1st record is greater than 'From Date' of Table 2's 1st record), result table's 'To Date' is updated based on nearest next date minus 1 second (In this case, 06-Jan-2017 12:00:00 AM minus 1 second).

Result:

Dnorm    From Date                  To Date                  User   Associated id
AA12345  02-Jan-2017 12:00:00 AM    05-Jan-2017 11:59:59 PM  LL7R   
AA12345  06-Jan-2017 12:00:00 AM    08-Jan-2017 11:59:59 PM  LL7R   AA12345, AA234567
AA12345  09-Jan-2017 12:00:00 AM    23-Jan-2017 11:59:59 AM  AT3B   AA12345, AA234567
AA12345  24-Jan-2017 12:00:00 AM    14-Feb-2017 11:59:59 PM  AT3B   AA12345, AA234567, AB56789
AA12345  15-Feb-2017 12:00:00 AM    31-Dec-3030 11:59:59 PM  UJ5G   AA12345, AA234567, AB56789

How do we achieve this effectively?

So what you want is called outer join and there are four types of this operation depending on the table which has the priority if the values in columns doesn't match.

In the example we have 2 tables

Table1

+------+--------------------+--------------------+----+
|    id|           From Date|             To Date|User|
+------+--------------------+--------------------+----+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|
+------+--------------------+--------------------+----+

Table2

+------+--------------------+--------------------+--------------------+
|    id|           From Date|             To Date|       Associated id|
+------+--------------------+--------------------+--------------------+
|AA1111|03-Jan-2017 12:00...|08-Jan-2017 11:59...|           [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|           [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|           [AA12345]|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|[AA12345, AA23456...|
+------+--------------------+--------------------+--------------------+

Note that the first row in Table 2 not only have the same id as the first row in the Table 1 but also same From Date and To Date values. Second row on the other hand have the same id and To Date but different From Date . Third row only has the same id and forth row is completely different. For the sake of simplicity we'll assume that this combinations cover all the variations in your data.

Now to the different types of joins

Full outer join

Full outer join will just create additional rows if the all three values are not exactly the same. It will break the IDs so use caution.

val dfFullOuter =
    table1
    .join( table2, Seq( "id", "From Date", "To Date" ), "outer" )

Result

+------+--------------------+--------------------+----+--------------------+
|    id|           From Date|             To Date|User|       Associated id|
+------+--------------------+--------------------+----+--------------------+
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|                null|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|                null|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|null|[AA12345, AA23456...|
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|           [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|null|           [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|null|           [AA12345]|
+------+--------------------+--------------------+----+--------------------+

As you can see the row with id AA1111 is merged successfully because there are no conflicting values. Other rows are just copied. This method is recommended only if you are absolutely confident that values in columns To Date and From Date will be the same for the rows with the same id .

You can also merge by id only and then decide to what table you want to give a priority. In this example priority is given to Table 2

val dfFullOuterManual =
    table1
    .join( table2, Seq( "id" ), "outer" )
    .drop( table1( "From Date" ) )
    .drop( table1( "To Date" ) )

Result

+------+----+--------------------+--------------------+--------------------+
|    id|User|           From Date|             To Date|       Associated id|
+------+----+--------------------+--------------------+--------------------+
|AA1112|AT3B|10-Jan-2017 12:00...|14-Feb-2017 11:59...|           [AA12345]|
|AA1111|LL7R|02-Jan-2017 12:00...|08-Jan-2017 11:59...|           [AA12345]|
|AA1114|null|24-Jan-2017 12:00...|31-Dec-3030 11:59...|[AA12345, AA23456...|
|AA1113|UJ5G|16-Feb-2017 12:00...|30-Dec-3030 11:59...|           [AA12345]|
+------+----+--------------------+--------------------+--------------------+

Left outer join

Left outer join will give priority to the values in Table 1 and even when there is only one conflict it will use all the values from that table. Note that Associated id values for conflicting rows are nulled because there is no such column in Table 1 . Also the row with id AA1114 will not be copied.

val dfLeftOuter =
    table1
    .join( table2, Seq( "id", "From Date", "To Date" ), "left_outer" )

Result

+------+--------------------+--------------------+----+-------------+
|    id|           From Date|             To Date|User|Associated id|
+------+--------------------+--------------------+----+-------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|    [AA12345]|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|         null|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|         null|
+------+--------------------+--------------------+----+-------------+

We resolved the conflict in columns From Date and To Date and now it's time to get missing Associated id values. To do that we need to merge the previous result with selected values from Table 2 .

val dfLeftOuterFinal =
    dfLeftOuter
    .join( table2.select( "id", "Associated id" ) , Seq( "id" ) )
    .drop( dfLeftOuter( "Associated id" ) )

Note that dropping original Associated id columns is necessary because it's taken from Table 1 and is mostly null .

Final result

+------+--------------------+--------------------+----+-------------+
|    id|           From Date|             To Date|User|Associated id|
+------+--------------------+--------------------+----+-------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|    [AA12345]|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|    [AA12345]|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|    [AA12345]|
+------+--------------------+--------------------+----+-------------+

Right outer join

Right outer join will give priority to the data in Table 2 and will add the completely different row ( AA1114 ) to the resulting table. Note that User values for conflicting rows are nulled because there is no such column in Table 2 .

val dfRightOuter =
    table1
    .join( table2, Seq( "id", "From Date", "To Date" ), "right_outer" )

Result

+------+--------------------+--------------------+----+--------------------+
|    id|           From Date|             To Date|User|       Associated id|
+------+--------------------+--------------------+----+--------------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|           [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|null|           [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|null|           [AA12345]|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|null|[AA12345, AA23456...|
+------+--------------------+--------------------+----+--------------------+

As with left outer join we have to retrieve missing values. Now it's User

val dfRightOuterFinal =
    dfRightOuter
    .join( table1.select( "id", "User" ) , Seq( "id" ) )
    .drop( dfRightOuter( "User" ) )

Final result

+------+--------------------+--------------------+-------------+----+
|    id|           From Date|             To Date|Associated id|User|
+------+--------------------+--------------------+-------------+----+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|    [AA12345]|LL7R|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|    [AA12345]|AT3B|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|    [AA12345]|UJ5G|
+------+--------------------+--------------------+-------------+----+

Note that row with id A1114 is gone because there is no User value for it.

Final thoughts

Depending on the data priority you can play with this combinations for other columns. As you can see these types of joins are also used to handle gaps in data according to your intentions.

My full test bench code

import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object Main {

    def main( args: Array[ String ] ): Unit = {

        val spark =
            SparkSession
            .builder()
            .appName( "SO" )
            .master( "local[*]" )
            .config( "spark.driver.host", "localhost" )
            .getOrCreate()

        import spark.implicits._

        val table1Data = Seq(
            ( "AA1111", "02-Jan-2017 12:00:00 AM", "08-Jan-2017 11:59:59 PM", "LL7R" ),
            ( "AA1112", "09-Jan-2017 12:00:00 AM", "14-Feb-2017 11:59:59 PM", "AT3B" ),
            ( "AA1113", "15-Feb-2017 12:00:00 AM", "31-Dec-3030 11:59:59 PM", "UJ5G" )
        )

        val table1 =
            table1Data
            .toDF( "id", "From Date", "To Date", "User" )

        val table2Data = Seq(
            ( "AA1111", "02-Jan-2017 12:00:00 AM", "08-Jan-2017 11:59:59 PM", Seq( "AA12345" ) ),
            ( "AA1112", "10-Jan-2017 12:00:00 AM", "14-Feb-2017 11:59:59 PM", Seq( "AA12345" ) ),
            ( "AA1113", "16-Feb-2017 12:00:00 AM", "30-Dec-3030 11:59:59 PM", Seq( "AA12345" ) ),
            ( "AA1114", "24-Jan-2017 12:00:00 AM", "31-Dec-3030 11:59:59 PM", Seq( "AA12345", "AA234567", "AB56789" ) )
        )

        val table2 =
            table2Data
            .toDF( "id", "From Date", "To Date", "Associated id" )

        val dfFullOuter =
            table1
            .join( table2, Seq( "id", "From Date", "To Date" ), "outer" )

        val dfFullOuterManual = 
            table1
            .join( table2, Seq( "id" ), "outer" )
            .drop( table1( "From Date" ) )
            .drop( table1( "To Date" ) )

        val dfLeftOuter =
            table1
            .join( table2, Seq( "id", "From Date", "To Date" ), "left_outer" )

        val dfLeftOuterFinal =
            dfLeftOuter
            .join( table2.select( "id", "Associated id" ) , Seq( "id" ) )
            .drop( dfLeftOuter( "Associated id" ) )

        val dfRightOuter =
            table1
            .join( table2, Seq( "id", "From Date", "To Date" ), "right_outer" )

        val dfRightOuterFinal =
            dfRightOuter
            .join( table1.select( "id", "User" ) , Seq( "id" ) )
            .drop( dfRightOuter( "User" ) )

        spark.stop()
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM