Filter in Apache Spark is not working expected

Question

I'am executing the code below:

ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
orderItemsRDD = sc.textFile("/user/cloudera/sqoop_import/order_items")

ordersParsedRDD = ordersRDD.filter(lambda rec: rec.split(",")[3] in "CANCELED").map(lambda rec: (int(rec.split(",")[0]), rec))
orderItemsParsedRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")[1]), float(rec.split(",")[4])))
orderItemsAgg = orderItemsParsedRDD.reduceByKey(lambda acc, value: (acc + value))

ordersJoinOrderItems = orderItemsAgg.join(ordersParsedRDD)

for i in ordersJoinOrderItems.filter(lambda rec: rec[1][0] >= 1000).take(5): print(i)

final results are not getting displayed for me, display stops at this condition.

before Join command I was able to display all records, after Join data is not getting displayed.
The error is shown below:

16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks
16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 67, boot = -57, init = 61, finish = 63
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 73, boot = -68, init = 75, finish = 66
16/06/01 17:26:05 INFO executor.Executor: Finished task 11.0 in stage 289.0 (TID 2689). 1461 bytes result sent to driver
16/06/01 17:26:05 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 289.0 (TID 2689) in 153 ms on localhost (12/24)

Answer 1

I would start by printing the count on ordersJoinOrderItems. If it is bigger than 0, it would then suggest your filter is the culprit.

Filter in Apache Spark is not working expected

Question

1 answers

solution1
0 2016-06-02 05:12:17

Filter in Apache Spark is not working expected

Question

1 answers

solution1 0 2016-06-02 05:12:17

solution1
0 2016-06-02 05:12:17