[英]How to perform column level validation by joining one Big data frame to many small data frame in spark
我有一個大表或數據框,其中有超過5000萬條記錄和135列。 現在,對於每一行,我需要對50多個列進行驗證。
因此,基本上,對於每一行,我需要從所有25個表中獲取相應的值。
我這里只列出了4張小桌子,但就我而言,我將有25張小桌子。
例如,這是我的一項驗證,稱為CityId Validation。
要進行CityId驗證,我們需要通過傳遞Tables1中的physicalstate或provincecode,physicalcountrycode和physicalcityname來獲得Table2中的TownCode
使用TownCode,我必須轉到Table3,傳遞物理國家代碼,physicalstate或省代碼和TownCode並獲取CityID。
如果CityID可用,則正確,否則為false。
這是我的數據框的樣子。
以上邏輯是其中一列的示例,但我必須對50多個列進行此類驗證。
我們可以這樣做嗎?
表1主表(5000萬條記錄)
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|filler1|dunsnumber|businessname |tradestylename |registeredaddressindicator|physicalstreetaddress |physicalstreetaddress2|physicalcityname|physicalstateorprovincename|physicalcountryname|physicalcitycode|physicalcountycode|physicalstateorprovincecode|physicalstateorprovinceabbreviation|physicalcountrycode|physicalpostalcode|physicalcontinentcode|mailingaddress|mailingcityname|mailingcountyname|mailingstateorprovincename|mailingcountryname|mailingcitycode|mailingcountycode|mailingstateorprovincecode|mailingstateorprovinceabbreviation|mailingcountrycode|mailingpostalcode|mailingcontinentcode|nationalidentificationnumber|nationalidentificationsystemcode|countrytelephoneaccesscode|telephonenumber|cabletelex|faxnumber |chiefexecutiveofficername|chiefexecutiveofficertitle|lineofbusiness |sic1|sic2|sic3|sic4|sic5|sic6|primarylocalactivitycode|activityindicator|yearstarted|annualsaleslocal |annualsalesindicator|annualsalesinusd|currencycode|employeeshere|employeeshereindicator|employeestotal|employeestotalindicator|includeprinciplesindicator|importexportagentindicator|legalstatus|filler2|statuscode|subsidiarycode|filler3|previousdunsnumber|financialstatementdate|filler4|headquarterorparentdunsnumber|headquarterorparentbusinessname |headquarterorparentstreetaddress|headquarterorparentcityname|headquarterorparentstateorprovincename|headquarterorparentcountryname|headquarterorparentcitycode|headquarterorparentcountycode|headquarterorparentstateorprovinceabbreviation|headquarterorparentcountrycode|headquarterorparentpostalcode|headquarterorparentcontinentcode|filler5|domesticultimatedunsnumbers|domesticultimatebusinessname |domesticultimatephysicalstreetaddress|domesticultimatecityname|domesticultimatestateorprovincename|domesticultimatecitycode|domesticultimatecountrycode|domesticultimatestateorprovinceabbreviation|domesticultimatepostalcode|globalultimateindicator|filler6|globalultimatedunsnumber|globalultimatebusinessname |globalultimatestreetaddress |globalultimatecityname|globalultimatestateorprovincename|globalultimatecountryname|globalultimatecitycode|globalultimatecountycode|globalultimatestateorprovinceabbreviation|globalultimatecountrycode|globalultimatepostalcode|globalultimatecontinentcode|numberoffamilymembers|diascode |hierarchycode|filler7|filler8|urldomain |naics1|naics2|naics3|naics4|naics5|naics6|publicprivateindicator|obindicator|latitude |longitude |oporactdescpart1 |oporactdescpart2|oporactdescpart3|oporactdescpart4|oporactdescpart5|nixieindicator|delistindicator|primary8digitsic|primary8digitdescription |primarynaicsdescription |natlidfull|transactionalindicator|
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
| |001007108 |DOLGENCORP, LLC |DOLLAR GENERAL |N |1342 PINE ST | |UNADILLA |GEORGIA |USA |008857 |296 |019 |GA |805 |31091 |6 | | | | | | |000 |000 | |000 | | | | |0001 |4786279585 | | |EVE MEADOWS |MANAGER |VARIETY STORES |5331| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000006 |1 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |068331990 |DOLGENCORP, LLC |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |USA |003754 |203 |TN |805 |370722171 |6 | |006946172 |DOLLAR GENERAL CORPORATION |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |003754 |805 |TN |370722171 |N | |006946172 |DOLLAR GENERAL CORPORATION |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |USA |003754 |203 |TN |805 |370722171 |6 |11210 |005479269|02 | | | |452319| | | | | | |N |+32.252708|-083.740074| | | | | |N |N |53310000 |VARIETY STORES |ALL OTHER GENERAL MERCHANDISE STORES | |C |
| |001132690 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|ADVANCE AMERICA |N |332 N L ROGERS WELLS BLVD| |GLASGOW |KENTUCKY |USA |003211 |060 |033 |KY |805 |421411300 |6 | | | | | | |000 |000 | |000 | | | | |0001 |2706511990 | | |LISA BROWN |MANAGER |PERSONAL CREDIT INSTITUTIONS |6141| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000002 |0 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |179469978 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|135 N CHURCH ST |SPARTANBURG |SOUTH CAROLINA |USA |008468 |839 |SC |805 |293065138 |6 | |078454395 |EAGLE U.S. SUB, INC. |135 N CHURCH ST |SPARTANBURG |SOUTH CAROLINA |008468 |805 |SC |293065138 |N | |811589639 |GRUPO ELEKTRA, S.A.B. DE C.V. |AV. FERROCARRIL DE RIO FRIO NO. 419 CJ|CIUDAD DE MEXICO |CIUDAD DE MEXICO |MEXICO |009100 |000 |CDMX |489 |09310 |5 |04316 |008037671|03 | | |WWW.ADVANCEAMERICA.NET |522291| | | | | | |N |+37.006016|-085.924526| | | | | |N |N |61410000 |PERSONAL CREDIT INSTITUTIONS |CONSUMER LENDING | |C |
| |001134456 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION | |N |126 DANIEL ST | |PORTSMOUTH |NEW HAMPSHIRE |USA |006885 |725 |057 |NH |805 |038013857 |6 | | | | | | |000 |000 | |000 | | | | |0001 | | | |BARBARA CONDA |MANAGER |NATIONAL COMMERCIAL BANKS, NSK |6021| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000015 |0 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |072147077 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |850 MAIN ST FL 6 |BRIDGEPORT |CONNECTICUT |USA |000677 |112 |CT |805 |066044917 |6 | |800407673 |PEOPLE'S UNITED FINANCIAL, INC. |850 MAIN ST |BRIDGEPORT |CONNECTICUT |000677 |805 |CT |066044917 |N | |800407673 |PEOPLE'S UNITED FINANCIAL, INC. |850 MAIN ST |BRIDGEPORT |CONNECTICUT |USA |000677 |112 |CT |805 |066044917 |6 |00583 |014029370|02 | | |WWW.BRANCHES.PEOPLES.COM|522110| | | | | | |N |+43.077690|-070.755372| | | | | |P |N |60210000 |NATIONAL COMMERCIAL BANKS |COMMERCIAL BANKING | |C |
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
參考表很小的表,大小不超過10MB
表2
+------------+------------+------------+-------------+---------+--------------+
|COUNTRY_CODE|COUNTRY_NAME|PROVINCE |PROVINCE_CODE|TOWN_CODE|TOWN_NAME |
+------------+------------+------------+-------------+---------+--------------+
|021 |ANDORRA |null |000 |000002 |ALDOSA |
|021 |ANDORRA |null |000 |000013 |EL TARTER |
|033 |ARGENTINA |BUENOS AIRES|001 |000223 |OLIVOS |
|033 |ARGENTINA |BUENOS AIRES|001 |000226 |PABLO PODESTA |
+------------+------------+------------+-------------+---------+--------------+
表3
+------+--------+-----------+---------+
|CityID|TownCode|CountryCode|StateCode|
+------+--------+-----------+---------+
|110880|006129 |805 |001 |
|110888|007554 |805 |005 |
|111164|004661 |805 |009 |
|111368|005193 |805 |075 |
+------+--------+-----------+---------+
表4標識符
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|IdentifierTypeId|Value |EntityId |ValueTypeId|EffectiveFrom |ProviderId|ProviderType|SourceUpdateDate|SourceLink|SourceType|EffectiveToNACode|EffectiveToMinus|EffectiveTo |EffectiveFromNACode|EffectiveFromPlus|NaCode|IsPrimary|ValueOrder|ValueTypeCode|EntityType|EntityTypeId|SysFrom |SysFileId |
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|320114 |3339 |4294963171|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |333997|4294963154|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |333999|4294963153|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |334 |4294963152|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
是的,您可以在Spark中完成。 有兩種方法:
broadcast
的小桌子,然后用filter
或where
上大表 broadcast join
這是第一種方法的基本示例。
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
val sql = new SQLContext(sc)
def main(args: Array[String]): Unit = {
sc.setLogLevel("ERROR")
import sql.implicits._
// Creating a DataFrame with valid data. Column names will be _1 and _2
val validDataRdd = sc.parallelize(Seq((1, 2), (2, 3), (3, 4), (10, 20), (20, 31), (30, 40), (100, 200), (200, 300)))
val validDataDf = sql.createDataFrame(validDataRdd)
// This is the big DataFrame. Column name is _1
val theData = sc.parallelize(1 to 10000).toDF()
// To broadcast data it first need to be brought locally
val localValidData = validDataDf.collect() // One can, instead of broadcasting Array[Row] transform Row into some custom case class for more convenient processing
val broadcastedValidData = sc.broadcast(localValidData)
// It's easier to do filtering on RDDs, but it also possible to use DataFrames.
theData.rdd.filter(rowBig =>
broadcastedValidData.value.exists(row => row.getAs[Int](0) == rowBig.getAs[Int](0))
).collect().foreach(println)
}
}
編輯(添加廣播加入示例):
val ordersByCustomer = ordersDataFrame
.join(broadcast(customersDataFrame), ordersDataFrame("customers_id") === customersDataFrame("id"), "left")
ordersByCustomer.foreach(customerOrder => {
println("> " + customerOrder.toString())
})
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.