Spark：静默执行sc.wholeTextFiles

Question

I am loading about 200k text files in Spark using input = sc.wholeTextFiles(hdfs://path/*) I then run a println(input.count) It turns out that my spark shell outputs a ton of text (which are the path of every file) and after a while it just hangs without returning my result. 我使用input = sc.wholeTextFiles(hdfs://path/*)在Spark中加载约20万个文本文件，然后运行println(input.count)事实证明，我的println(input.count) Shell输出大量文本（每个文件的路径），不久后它挂起而没有返回我的结果。

I believe this may be due to the amount of text outputted by wholeTextFiles . 我相信这可能是由于wholeTextFiles输出的文本量wholeTextFiles 。 Do you know of any way to run this command silently? 您是否知道以任何方式静默运行此命令？ or is there a better workaround? 还是有更好的解决方法？

Thanks! 谢谢！

Answer 1

How large are your files? 您的文件有多大？ From the wholeTextFiles API : 从wholeTextFiles API中：

Small files are preferred, large files are also allowable, but may cause bad performance. 小文件是首选，大文件也是允许的，但可能会导致性能下降。

In conf/log4j.properties , you can suppress excessive logging, like this: 在conf/log4j.properties ，您可以禁止过多的日志记录，如下所示：

# Set everything to be logged to the console
log4j.rootCategory=ERROR, console

That way, you'll get back only res to the repl , just like in the Scala (the language) repl . 这样一来，你会得到只有res的REPL，就像在斯卡拉（语言）REPL。

Here are all other logging levels you can play with: log4j API . 这是您可以使用的所有其他日志记录级别： log4j API 。

Spark：静默执行sc.wholeTextFiles

问题描述

1 个解决方案

解决方案1
1 2015-01-11 12:07:02

Spark：静默执行sc.wholeTextFiles

问题描述

1 个解决方案

解决方案1 1 2015-01-11 12:07:02

解决方案1
1 2015-01-11 12:07:02