简体   繁体   English

不能在Spark 1.5.2中使用超过5个数据框

[英]join Not working in spark 1.5.2 with more than 5 data frame

I would like to do more joins on multiple tables with pyspark, 我想用pyspark在多个表上做更多的联接,

and then I want to partition by date my tables: 然后我想按日期对表进行分区:

my setup is as follows: 10GO MEMORY for driver 10 workers with 5 cores and 10G Memory yarn-customer 我的设置如下:10GO MEMORY,用于驱动程序10个具有5芯和10G Memory yarn客户的工人

Table = 700 MO 表格= 700 MO

Table2 = 1GB 表2 = 1GB

Table 3 = 3Go 表3 = 3Go

Table4 = 12 表4 = 12

Table 4 go 表4转到

table5 =6go table5 = 6go

I tried 我试过了

sqlcontext.sql ("select * from Tab1 left join tab2 is tab1.numcnt =      

tab2.numcnt  left join tab3 is tab1.numcnt = tab3.numcnt
left join TAB4 is tab1.numcnt = tab4.numcnt
")

when I use this query she takes a crazy time 当我使用此查询时,她花费了疯狂的时间

I also tried the data frame methods: 我还尝试了数据框方法:

df = join df_tab1.join (df_tab2, df_tab.NUMCNT df_tab2.NUMCNT == 'left_outer')

 dfjoin.join (df_tab3, df_join.NUMCNT df_tab3.NUMCNT == 'left_outer')

same problem 24 hours of treatment without result 同样的问题24小时治疗无结果

if you can advise me how to properly join these thank you in advance 如果您能建议我如何正确加入这些活动,请先谢谢您

You need to cache all the tables which you are joining before performing the join. 在执行联接之前,需要缓存要联接的所有表。 I would recommend to cache them, do a count on them and then perform join. 我建议缓存它们,对它们进行计数,然后执行连接。 I faced similar issue too but after performing join on cached tables, runtime gets significantly reduced. 我也遇到过类似的问题,但是在对缓存的表执行连接之后,运行时间将大大减少。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM