Spark speed performance

Question

I have program that I have for single computer (in Python) exection and also implemented the same for Spark. This program basically only reads .json from which it takes one field and saves it back. Using Spark my program runs aproximately 100 times slower on 1 master and 1 slave then the single node standard Python program for this (of course I'm reading from file and saving to file there). So I would like to ask where possibly might be the problem?

My Spark program looks like:

sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2])
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json) 
cleanedData.saveAsTextFile(sys.argv[3])

JsonExtractor only selects the data from field that is given by sys.argv[1].

My data are basically many small one line files, where this line is always json.

I have tried both, reading and writing the data from/to Amazon S3 and local disc on all the machines.

I would like to ask if there is something that I might be missing or if Spark is supposed to be so slow in comparison with the local non paralleled single node program.

Answer 1

As it was advised to me at the Spark mailing list the problem was in the lot of very small json files.

Performance can be much improved either by merging small files to one bigger or by:

sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2]).coalesce(10)     #10 partition tasks
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json) 
cleanedData.saveAsTextFile(sys.argv[3])

Spark speed performance

Question

1 answers

solution1
0 ACCPTED 2014-10-19 08:44:04

Spark speed performance

Question

1 answers

solution1 0 ACCPTED 2014-10-19 08:44:04

solution1
0 ACCPTED 2014-10-19 08:44:04