简体   繁体   English

Apache Spark中的过滤器无法正常工作

[英]Filter in Apache Spark is not working expected

I'am executing the code below: 我正在执行以下代码:

ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
orderItemsRDD = sc.textFile("/user/cloudera/sqoop_import/order_items")

ordersParsedRDD = ordersRDD.filter(lambda rec: rec.split(",")[3] in "CANCELED").map(lambda rec: (int(rec.split(",")[0]), rec))
orderItemsParsedRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")[1]), float(rec.split(",")[4])))
orderItemsAgg = orderItemsParsedRDD.reduceByKey(lambda acc, value: (acc + value))

ordersJoinOrderItems = orderItemsAgg.join(ordersParsedRDD)

for i in ordersJoinOrderItems.filter(lambda rec: rec[1][0] >= 1000).take(5): print(i)

final results are not getting displayed for me, display stops at this condition. 最终结果没有显示给我,显示在这种情况下停止。

before Join command I was able to display all records, after Join data is not getting displayed. Join命令之前,我能够显示所有记录,之后没有显示Join数据。
The error is shown below: 错误如下所示:

16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks
16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 67, boot = -57, init = 61, finish = 63
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 73, boot = -68, init = 75, finish = 66
16/06/01 17:26:05 INFO executor.Executor: Finished task 11.0 in stage 289.0 (TID 2689). 1461 bytes result sent to driver
16/06/01 17:26:05 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 289.0 (TID 2689) in 153 ms on localhost (12/24)

I would start by printing the count on ordersJoinOrderItems. 我首先打印ordersJoinOrderItems上的计数。 If it is bigger than 0, it would then suggest your filter is the culprit. 如果它大于0,那么它会建议你的过滤器是罪魁祸首。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM