简体   繁体   English

如何通过在Spark中将一个大数据帧连接到许多小数据帧来执行列级验证

[英]How to perform column level validation by joining one Big data frame to many small data frame in spark

I have one big tables or data frame that has more than 50 millions records and 135 columns . 我有一个大表或数据框,其中有超过5000万条记录和135列。 Now for each row i need to perform validation for more than 50 columns . 现在,对于每一行,我需要对50多个列进行验证。

So basically for each row each columns i need to get corresponding value from all 25 tables . 因此,基本上,对于每一行,我需要从所有25个表中获取相应的值。

I have listed here only 4 small tables but in my case i will have 25 such tables . 我这里只列出了4张小桌子,但就我而言,我将有25张小桌子。

For example here is one of my validation called CityId Validation . 例如,这是我的一项验证,称为CityId Validation。

To do CityId Validation we need TownCode from Table2 by passing physicalstateorprovincecode ,physicalcountrycode and physicalcityname from Tables1 要进行CityId验证,我们需要通过传递Tables1中的physicalstate或provincecode,physicalcountrycode和physicalcityname来获得Table2中的TownCode

With TownCode i have to go to Table3 pass physicalcountrycode,physicalstateorprovincecode and TownCode and get CityID. 使用TownCode,我必须转到Table3,传递物理国家代码,physicalstate或省代码和TownCode并获取CityID。

If CityID is available then it is correct elase false . 如果CityID可用,则正确,否则为false。

Here is how my data frames looks like . 这是我的数据框的样子。

And above logic is example for one of the column but i have to do such validation for more than 50 columns. 以上逻辑是其中一列的示例,但我必须对50多个列进行此类验证。

Can we do this in spark ? 我们可以这样做吗?

Table1 Main Table (50 Millions Records) 表1主表(5000万条记录)

+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|filler1|dunsnumber|businessname                               |tradestylename              |registeredaddressindicator|physicalstreetaddress    |physicalstreetaddress2|physicalcityname|physicalstateorprovincename|physicalcountryname|physicalcitycode|physicalcountycode|physicalstateorprovincecode|physicalstateorprovinceabbreviation|physicalcountrycode|physicalpostalcode|physicalcontinentcode|mailingaddress|mailingcityname|mailingcountyname|mailingstateorprovincename|mailingcountryname|mailingcitycode|mailingcountycode|mailingstateorprovincecode|mailingstateorprovinceabbreviation|mailingcountrycode|mailingpostalcode|mailingcontinentcode|nationalidentificationnumber|nationalidentificationsystemcode|countrytelephoneaccesscode|telephonenumber|cabletelex|faxnumber |chiefexecutiveofficername|chiefexecutiveofficertitle|lineofbusiness                           |sic1|sic2|sic3|sic4|sic5|sic6|primarylocalactivitycode|activityindicator|yearstarted|annualsaleslocal  |annualsalesindicator|annualsalesinusd|currencycode|employeeshere|employeeshereindicator|employeestotal|employeestotalindicator|includeprinciplesindicator|importexportagentindicator|legalstatus|filler2|statuscode|subsidiarycode|filler3|previousdunsnumber|financialstatementdate|filler4|headquarterorparentdunsnumber|headquarterorparentbusinessname            |headquarterorparentstreetaddress|headquarterorparentcityname|headquarterorparentstateorprovincename|headquarterorparentcountryname|headquarterorparentcitycode|headquarterorparentcountycode|headquarterorparentstateorprovinceabbreviation|headquarterorparentcountrycode|headquarterorparentpostalcode|headquarterorparentcontinentcode|filler5|domesticultimatedunsnumbers|domesticultimatebusinessname          |domesticultimatephysicalstreetaddress|domesticultimatecityname|domesticultimatestateorprovincename|domesticultimatecitycode|domesticultimatecountrycode|domesticultimatestateorprovinceabbreviation|domesticultimatepostalcode|globalultimateindicator|filler6|globalultimatedunsnumber|globalultimatebusinessname            |globalultimatestreetaddress           |globalultimatecityname|globalultimatestateorprovincename|globalultimatecountryname|globalultimatecitycode|globalultimatecountycode|globalultimatestateorprovinceabbreviation|globalultimatecountrycode|globalultimatepostalcode|globalultimatecontinentcode|numberoffamilymembers|diascode |hierarchycode|filler7|filler8|urldomain               |naics1|naics2|naics3|naics4|naics5|naics6|publicprivateindicator|obindicator|latitude  |longitude  |oporactdescpart1                                                                                                                                                                                                                         |oporactdescpart2|oporactdescpart3|oporactdescpart4|oporactdescpart5|nixieindicator|delistindicator|primary8digitsic|primary8digitdescription                                    |primarynaicsdescription                                        |natlidfull|transactionalindicator|
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|       |001007108 |DOLGENCORP, LLC                            |DOLLAR GENERAL              |N                         |1342 PINE ST             |                      |UNADILLA        |GEORGIA                    |USA                |008857          |296               |019                        |GA                                 |805                |31091             |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |4786279585     |          |          |EVE MEADOWS              |MANAGER                   |VARIETY STORES                           |5331|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000006      |1                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |068331990                    |DOLGENCORP, LLC                            |100 MISSION RDG                 |GOODLETTSVILLE             |TENNESSEE                             |USA                           |003754                     |203                          |TN                                            |805                           |370722171                    |6                               |       |006946172                  |DOLLAR GENERAL CORPORATION            |100 MISSION RDG                      |GOODLETTSVILLE          |TENNESSEE                          |003754                  |805                        |TN                                         |370722171                 |N                      |       |006946172               |DOLLAR GENERAL CORPORATION            |100 MISSION RDG                       |GOODLETTSVILLE        |TENNESSEE                        |USA                      |003754                |203                     |TN                                       |805                      |370722171               |6                          |11210                |005479269|02           |       |       |                        |452319|      |      |      |      |      |                      |N          |+32.252708|-083.740074|                                                                                                                                                                                                                                         |                |                |                |                |N             |N              |53310000        |VARIETY STORES                                              |ALL OTHER GENERAL MERCHANDISE STORES                           |          |C                     |
|       |001132690 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|ADVANCE AMERICA             |N                         |332 N L ROGERS WELLS BLVD|                      |GLASGOW         |KENTUCKY                   |USA                |003211          |060               |033                        |KY                                 |805                |421411300         |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |2706511990     |          |          |LISA BROWN               |MANAGER                   |PERSONAL CREDIT INSTITUTIONS             |6141|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000002      |0                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |179469978                    |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|135 N CHURCH ST                 |SPARTANBURG                |SOUTH CAROLINA                        |USA                           |008468                     |839                          |SC                                            |805                           |293065138                    |6                               |       |078454395                  |EAGLE U.S. SUB, INC.                  |135 N CHURCH ST                      |SPARTANBURG             |SOUTH CAROLINA                     |008468                  |805                        |SC                                         |293065138                 |N                      |       |811589639               |GRUPO ELEKTRA, S.A.B. DE C.V.         |AV. FERROCARRIL DE RIO FRIO NO. 419 CJ|CIUDAD DE MEXICO      |CIUDAD DE MEXICO                 |MEXICO                   |009100                |000                     |CDMX                                     |489                      |09310                   |5                          |04316                |008037671|03           |       |       |WWW.ADVANCEAMERICA.NET  |522291|      |      |      |      |      |                      |N          |+37.006016|-085.924526|                                                                                                                                                                                                                                         |                |                |                |                |N             |N              |61410000        |PERSONAL CREDIT INSTITUTIONS                                |CONSUMER LENDING                                               |          |C                     |
|       |001134456 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |                            |N                         |126 DANIEL ST            |                      |PORTSMOUTH      |NEW HAMPSHIRE              |USA                |006885          |725               |057                        |NH                                 |805                |038013857         |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |               |          |          |BARBARA CONDA            |MANAGER                   |NATIONAL COMMERCIAL BANKS, NSK           |6021|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000015      |0                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |072147077                    |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |850 MAIN ST FL 6                |BRIDGEPORT                 |CONNECTICUT                           |USA                           |000677                     |112                          |CT                                            |805                           |066044917                    |6                               |       |800407673                  |PEOPLE'S UNITED FINANCIAL, INC.       |850 MAIN ST                          |BRIDGEPORT              |CONNECTICUT                        |000677                  |805                        |CT                                         |066044917                 |N                      |       |800407673               |PEOPLE'S UNITED FINANCIAL, INC.       |850 MAIN ST                           |BRIDGEPORT            |CONNECTICUT                      |USA                      |000677                |112                     |CT                                       |805                      |066044917               |6                          |00583                |014029370|02           |       |       |WWW.BRANCHES.PEOPLES.COM|522110|      |      |      |      |      |                      |N          |+43.077690|-070.755372|                                                                                                                                                                                                                                         |                |                |                |                |P             |N              |60210000        |NATIONAL COMMERCIAL BANKS                                   |COMMERCIAL BANKING                                             |          |C                     |
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+

reference Tables very very small tables not more than 10MB size 参考表很小的表,大小不超过10MB

Table2 表2

+------------+------------+------------+-------------+---------+--------------+
|COUNTRY_CODE|COUNTRY_NAME|PROVINCE    |PROVINCE_CODE|TOWN_CODE|TOWN_NAME     |
+------------+------------+------------+-------------+---------+--------------+
|021         |ANDORRA     |null        |000          |000002   |ALDOSA        |
|021         |ANDORRA     |null        |000          |000013   |EL TARTER     |
|033         |ARGENTINA   |BUENOS AIRES|001          |000223   |OLIVOS        |
|033         |ARGENTINA   |BUENOS AIRES|001          |000226   |PABLO PODESTA |
+------------+------------+------------+-------------+---------+--------------+

Table3 表3

+------+--------+-----------+---------+
|CityID|TownCode|CountryCode|StateCode|   
+------+--------+-----------+---------+
|110880|006129  |805        |001      |
|110888|007554  |805        |005      |
|111164|004661  |805        |009      |
|111368|005193  |805        |075      |
+------+--------+-----------+---------+

Table4 Identifier 表4标识符

+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|IdentifierTypeId|Value |EntityId  |ValueTypeId|EffectiveFrom       |ProviderId|ProviderType|SourceUpdateDate|SourceLink|SourceType|EffectiveToNACode|EffectiveToMinus|EffectiveTo           |EffectiveFromNACode|EffectiveFromPlus|NaCode|IsPrimary|ValueOrder|ValueTypeCode|EntityType|EntityTypeId|SysFrom             |SysFileId           |
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|320114          |3339  |4294963171|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |333997|4294963154|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |333999|4294963153|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |334   |4294963152|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+

Yes, you can do it in Spark. 是的,您可以在Spark中完成。 And there are two approaches: 有两种方法:

  1. Do broadcast on small tables and then use filter or where on big table 不要broadcast的小桌子,然后用filterwhere上大表
  2. Do broadcast join broadcast join

Here is basic example of first approach. 这是第一种方法的基本示例。

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._

object Main {

  val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
  val sc = new SparkContext(conf)
  val sql = new SQLContext(sc)

  def main(args: Array[String]): Unit = {

    sc.setLogLevel("ERROR")
    import sql.implicits._

    // Creating a DataFrame with valid data. Column names will be _1 and _2
    val validDataRdd = sc.parallelize(Seq((1, 2), (2, 3), (3, 4), (10, 20), (20, 31), (30, 40), (100, 200), (200, 300)))
    val validDataDf = sql.createDataFrame(validDataRdd)

    // This is the big DataFrame. Column name is _1
    val theData = sc.parallelize(1 to 10000).toDF()

    // To broadcast data it first need to be brought locally
    val localValidData = validDataDf.collect()    // One can, instead of broadcasting Array[Row] transform Row into some custom case class for more convenient processing
    val broadcastedValidData = sc.broadcast(localValidData)

    // It's easier to do filtering on RDDs, but it also possible to use DataFrames.
    theData.rdd.filter(rowBig =>
      broadcastedValidData.value.exists(row => row.getAs[Int](0) == rowBig.getAs[Int](0))
    ).collect().foreach(println)
  }
}

EDIT (added broadcast join example): 编辑(添加广播加入示例):

val ordersByCustomer = ordersDataFrame
    .join(broadcast(customersDataFrame), ordersDataFrame("customers_id") === customersDataFrame("id"), "left")
  ordersByCustomer.foreach(customerOrder => {
    println("> " + customerOrder.toString())
  })

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM