Apache Spark中的过滤器无法正常工作

Question

I'am executing the code below: 我正在执行以下代码：

ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
orderItemsRDD = sc.textFile("/user/cloudera/sqoop_import/order_items")

ordersParsedRDD = ordersRDD.filter(lambda rec: rec.split(",")[3] in "CANCELED").map(lambda rec: (int(rec.split(",")[0]), rec))
orderItemsParsedRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")[1]), float(rec.split(",")[4])))
orderItemsAgg = orderItemsParsedRDD.reduceByKey(lambda acc, value: (acc + value))

ordersJoinOrderItems = orderItemsAgg.join(ordersParsedRDD)

for i in ordersJoinOrderItems.filter(lambda rec: rec[1][0] >= 1000).take(5): print(i)

final results are not getting displayed for me, display stops at this condition. 最终结果没有显示给我，显示在这种情况下停止。

before Join command I was able to display all records, after Join data is not getting displayed. 在Join命令之前，我能够显示所有记录，之后没有显示Join数据。
The error is shown below: 错误如下所示：

16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks
16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 67, boot = -57, init = 61, finish = 63
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 73, boot = -68, init = 75, finish = 66
16/06/01 17:26:05 INFO executor.Executor: Finished task 11.0 in stage 289.0 (TID 2689). 1461 bytes result sent to driver
16/06/01 17:26:05 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 289.0 (TID 2689) in 153 ms on localhost (12/24)

Answer 1

I would start by printing the count on ordersJoinOrderItems. 我首先打印ordersJoinOrderItems上的计数。 If it is bigger than 0, it would then suggest your filter is the culprit. 如果它大于0，那么它会建议你的过滤器是罪魁祸首。

Apache Spark中的过滤器无法正常工作

问题描述

1 个解决方案

解决方案1
0 2016-06-02 05:12:17

Apache Spark中的过滤器无法正常工作

问题描述

1 个解决方案

解决方案1 0 2016-06-02 05:12:17

解决方案1
0 2016-06-02 05:12:17