不能在Spark 1.5.2中使用超过5个数据框

Question

I would like to do more joins on multiple tables with pyspark, 我想用pyspark在多个表上做更多的联接，

and then I want to partition by date my tables: 然后我想按日期对表进行分区：

my setup is as follows: 10GO MEMORY for driver 10 workers with 5 cores and 10G Memory yarn-customer 我的设置如下：10GO MEMORY，用于驱动程序10个具有5芯和10G Memory yarn客户的工人

Table = 700 MO 表格= 700 MO

Table2 = 1GB 表2 = 1GB

Table 3 = 3Go 表3 = 3Go

Table4 = 12 表4 = 12

Table 4 go 表4转到

table5 =6go table5 = 6go

I tried 我试过了

sqlcontext.sql ("select * from Tab1 left join tab2 is tab1.numcnt =      

tab2.numcnt  left join tab3 is tab1.numcnt = tab3.numcnt
left join TAB4 is tab1.numcnt = tab4.numcnt
")

when I use this query she takes a crazy time 当我使用此查询时，她花费了疯狂的时间

I also tried the data frame methods: 我还尝试了数据框方法：

df = join df_tab1.join (df_tab2, df_tab.NUMCNT df_tab2.NUMCNT == 'left_outer')

 dfjoin.join (df_tab3, df_join.NUMCNT df_tab3.NUMCNT == 'left_outer')

same problem 24 hours of treatment without result 同样的问题24小时治疗无结果

if you can advise me how to properly join these thank you in advance 如果您能建议我如何正确加入这些活动，请先谢谢您

Answer 1

You need to cache all the tables which you are joining before performing the join. 在执行联接之前，需要缓存要联接的所有表。 I would recommend to cache them, do a count on them and then perform join. 我建议缓存它们，对它们进行计数，然后执行连接。 I faced similar issue too but after performing join on cached tables, runtime gets significantly reduced. 我也遇到过类似的问题，但是在对缓存的表执行连接之后，运行时间将大大减少。

不能在Spark 1.5.2中使用超过5个数据框

问题描述

1 个解决方案

解决方案1
0 2016-05-24 15:35:45

不能在Spark 1.5.2中使用超过5个数据框

问题描述

1 个解决方案

解决方案1 0 2016-05-24 15:35:45

解决方案1
0 2016-05-24 15:35:45