[英]How to perform column level validation by joining one Big data frame to many small data frame in spark
我有一个大表或数据框,其中有超过5000万条记录和135列。 现在,对于每一行,我需要对50多个列进行验证。
因此,基本上,对于每一行,我需要从所有25个表中获取相应的值。
我这里只列出了4张小桌子,但就我而言,我将有25张小桌子。
例如,这是我的一项验证,称为CityId Validation。
要进行CityId验证,我们需要通过传递Tables1中的physicalstate或provincecode,physicalcountrycode和physicalcityname来获得Table2中的TownCode
使用TownCode,我必须转到Table3,传递物理国家代码,physicalstate或省代码和TownCode并获取CityID。
如果CityID可用,则正确,否则为false。
这是我的数据框的样子。
以上逻辑是其中一列的示例,但我必须对50多个列进行此类验证。
我们可以这样做吗?
表1主表(5000万条记录)
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|filler1|dunsnumber|businessname |tradestylename |registeredaddressindicator|physicalstreetaddress |physicalstreetaddress2|physicalcityname|physicalstateorprovincename|physicalcountryname|physicalcitycode|physicalcountycode|physicalstateorprovincecode|physicalstateorprovinceabbreviation|physicalcountrycode|physicalpostalcode|physicalcontinentcode|mailingaddress|mailingcityname|mailingcountyname|mailingstateorprovincename|mailingcountryname|mailingcitycode|mailingcountycode|mailingstateorprovincecode|mailingstateorprovinceabbreviation|mailingcountrycode|mailingpostalcode|mailingcontinentcode|nationalidentificationnumber|nationalidentificationsystemcode|countrytelephoneaccesscode|telephonenumber|cabletelex|faxnumber |chiefexecutiveofficername|chiefexecutiveofficertitle|lineofbusiness |sic1|sic2|sic3|sic4|sic5|sic6|primarylocalactivitycode|activityindicator|yearstarted|annualsaleslocal |annualsalesindicator|annualsalesinusd|currencycode|employeeshere|employeeshereindicator|employeestotal|employeestotalindicator|includeprinciplesindicator|importexportagentindicator|legalstatus|filler2|statuscode|subsidiarycode|filler3|previousdunsnumber|financialstatementdate|filler4|headquarterorparentdunsnumber|headquarterorparentbusinessname |headquarterorparentstreetaddress|headquarterorparentcityname|headquarterorparentstateorprovincename|headquarterorparentcountryname|headquarterorparentcitycode|headquarterorparentcountycode|headquarterorparentstateorprovinceabbreviation|headquarterorparentcountrycode|headquarterorparentpostalcode|headquarterorparentcontinentcode|filler5|domesticultimatedunsnumbers|domesticultimatebusinessname |domesticultimatephysicalstreetaddress|domesticultimatecityname|domesticultimatestateorprovincename|domesticultimatecitycode|domesticultimatecountrycode|domesticultimatestateorprovinceabbreviation|domesticultimatepostalcode|globalultimateindicator|filler6|globalultimatedunsnumber|globalultimatebusinessname |globalultimatestreetaddress |globalultimatecityname|globalultimatestateorprovincename|globalultimatecountryname|globalultimatecitycode|globalultimatecountycode|globalultimatestateorprovinceabbreviation|globalultimatecountrycode|globalultimatepostalcode|globalultimatecontinentcode|numberoffamilymembers|diascode |hierarchycode|filler7|filler8|urldomain |naics1|naics2|naics3|naics4|naics5|naics6|publicprivateindicator|obindicator|latitude |longitude |oporactdescpart1 |oporactdescpart2|oporactdescpart3|oporactdescpart4|oporactdescpart5|nixieindicator|delistindicator|primary8digitsic|primary8digitdescription |primarynaicsdescription |natlidfull|transactionalindicator|
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
| |001007108 |DOLGENCORP, LLC |DOLLAR GENERAL |N |1342 PINE ST | |UNADILLA |GEORGIA |USA |008857 |296 |019 |GA |805 |31091 |6 | | | | | | |000 |000 | |000 | | | | |0001 |4786279585 | | |EVE MEADOWS |MANAGER |VARIETY STORES |5331| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000006 |1 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |068331990 |DOLGENCORP, LLC |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |USA |003754 |203 |TN |805 |370722171 |6 | |006946172 |DOLLAR GENERAL CORPORATION |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |003754 |805 |TN |370722171 |N | |006946172 |DOLLAR GENERAL CORPORATION |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |USA |003754 |203 |TN |805 |370722171 |6 |11210 |005479269|02 | | | |452319| | | | | | |N |+32.252708|-083.740074| | | | | |N |N |53310000 |VARIETY STORES |ALL OTHER GENERAL MERCHANDISE STORES | |C |
| |001132690 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|ADVANCE AMERICA |N |332 N L ROGERS WELLS BLVD| |GLASGOW |KENTUCKY |USA |003211 |060 |033 |KY |805 |421411300 |6 | | | | | | |000 |000 | |000 | | | | |0001 |2706511990 | | |LISA BROWN |MANAGER |PERSONAL CREDIT INSTITUTIONS |6141| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000002 |0 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |179469978 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|135 N CHURCH ST |SPARTANBURG |SOUTH CAROLINA |USA |008468 |839 |SC |805 |293065138 |6 | |078454395 |EAGLE U.S. SUB, INC. |135 N CHURCH ST |SPARTANBURG |SOUTH CAROLINA |008468 |805 |SC |293065138 |N | |811589639 |GRUPO ELEKTRA, S.A.B. DE C.V. |AV. FERROCARRIL DE RIO FRIO NO. 419 CJ|CIUDAD DE MEXICO |CIUDAD DE MEXICO |MEXICO |009100 |000 |CDMX |489 |09310 |5 |04316 |008037671|03 | | |WWW.ADVANCEAMERICA.NET |522291| | | | | | |N |+37.006016|-085.924526| | | | | |N |N |61410000 |PERSONAL CREDIT INSTITUTIONS |CONSUMER LENDING | |C |
| |001134456 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION | |N |126 DANIEL ST | |PORTSMOUTH |NEW HAMPSHIRE |USA |006885 |725 |057 |NH |805 |038013857 |6 | | | | | | |000 |000 | |000 | | | | |0001 | | | |BARBARA CONDA |MANAGER |NATIONAL COMMERCIAL BANKS, NSK |6021| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000015 |0 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |072147077 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |850 MAIN ST FL 6 |BRIDGEPORT |CONNECTICUT |USA |000677 |112 |CT |805 |066044917 |6 | |800407673 |PEOPLE'S UNITED FINANCIAL, INC. |850 MAIN ST |BRIDGEPORT |CONNECTICUT |000677 |805 |CT |066044917 |N | |800407673 |PEOPLE'S UNITED FINANCIAL, INC. |850 MAIN ST |BRIDGEPORT |CONNECTICUT |USA |000677 |112 |CT |805 |066044917 |6 |00583 |014029370|02 | | |WWW.BRANCHES.PEOPLES.COM|522110| | | | | | |N |+43.077690|-070.755372| | | | | |P |N |60210000 |NATIONAL COMMERCIAL BANKS |COMMERCIAL BANKING | |C |
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
参考表很小的表,大小不超过10MB
表2
+------------+------------+------------+-------------+---------+--------------+
|COUNTRY_CODE|COUNTRY_NAME|PROVINCE |PROVINCE_CODE|TOWN_CODE|TOWN_NAME |
+------------+------------+------------+-------------+---------+--------------+
|021 |ANDORRA |null |000 |000002 |ALDOSA |
|021 |ANDORRA |null |000 |000013 |EL TARTER |
|033 |ARGENTINA |BUENOS AIRES|001 |000223 |OLIVOS |
|033 |ARGENTINA |BUENOS AIRES|001 |000226 |PABLO PODESTA |
+------------+------------+------------+-------------+---------+--------------+
表3
+------+--------+-----------+---------+
|CityID|TownCode|CountryCode|StateCode|
+------+--------+-----------+---------+
|110880|006129 |805 |001 |
|110888|007554 |805 |005 |
|111164|004661 |805 |009 |
|111368|005193 |805 |075 |
+------+--------+-----------+---------+
表4标识符
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|IdentifierTypeId|Value |EntityId |ValueTypeId|EffectiveFrom |ProviderId|ProviderType|SourceUpdateDate|SourceLink|SourceType|EffectiveToNACode|EffectiveToMinus|EffectiveTo |EffectiveFromNACode|EffectiveFromPlus|NaCode|IsPrimary|ValueOrder|ValueTypeCode|EntityType|EntityTypeId|SysFrom |SysFileId |
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|320114 |3339 |4294963171|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |333997|4294963154|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |333999|4294963153|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |334 |4294963152|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
是的,您可以在Spark中完成。 有两种方法:
broadcast
的小桌子,然后用filter
或where
上大表 broadcast join
这是第一种方法的基本示例。
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
val sql = new SQLContext(sc)
def main(args: Array[String]): Unit = {
sc.setLogLevel("ERROR")
import sql.implicits._
// Creating a DataFrame with valid data. Column names will be _1 and _2
val validDataRdd = sc.parallelize(Seq((1, 2), (2, 3), (3, 4), (10, 20), (20, 31), (30, 40), (100, 200), (200, 300)))
val validDataDf = sql.createDataFrame(validDataRdd)
// This is the big DataFrame. Column name is _1
val theData = sc.parallelize(1 to 10000).toDF()
// To broadcast data it first need to be brought locally
val localValidData = validDataDf.collect() // One can, instead of broadcasting Array[Row] transform Row into some custom case class for more convenient processing
val broadcastedValidData = sc.broadcast(localValidData)
// It's easier to do filtering on RDDs, but it also possible to use DataFrames.
theData.rdd.filter(rowBig =>
broadcastedValidData.value.exists(row => row.getAs[Int](0) == rowBig.getAs[Int](0))
).collect().foreach(println)
}
}
编辑(添加广播加入示例):
val ordersByCustomer = ordersDataFrame
.join(broadcast(customersDataFrame), ordersDataFrame("customers_id") === customersDataFrame("id"), "left")
ordersByCustomer.foreach(customerOrder => {
println("> " + customerOrder.toString())
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.