Spark和Python使用自定义文件格式/生成器作为RDD的输入

Question

I would like to ask about input possibilities in Spark. 我想问一下Spark中输入的可能性。 I can see from http://spark.apache.org/docs/latest/programming-guide.html , that I can use sc.textFile() for reading the text files to RDD, but I would like to do some preprocessing, before the distribution to RDD will happen, for example my file might be in JSON format eg. 我可以从http://spark.apache.org/docs/latest/programming-guide.html看到，我可以使用sc.textFile()将文本文件读取为RDD，但是我想做一些预处理，在分发到RDD之前，例如我的文件可能是JSON格式，例如 {id:123, text:"...", value:6} and I would like to use only certain fields of the JSON for further processing. {id:123, text:"...", value:6} ，我只想使用JSON的某些字段进行进一步处理。

My idea was if there is possible somehow to use Python generator as an input to the SparkContext? 我的想法是，是否有可能使用Python生成器作为SparkContext的输入？

Or if there is some more natural way in Spark how to process custom, not text only files by Spark? 或者，如果Spark中有一些更自然的方法来处理自定义，而不是Spark处理纯文本文件呢？

EDIT: 编辑：

It seems that accepted answer should work, but it has moved me to my more practical following question Spark and Python trying to parse wikipedia using gensim 似乎已接受的答案应该可行，但是它使我转向了更实际的问题，以下问题是Spark和Python尝试使用gensim解析Wikipedia。

Answer 1

The fastest way to do this is probably to load the text file as-is and do your processing to select desired fields on the resulting RDD. 最快的方法可能是按原样加载文本文件，并进行处理以在生成的RDD上选择所需的字段。 This parallelizes that work across the cluster and will scale more efficiently than doing any preprocessing on a single machine. 这将使整个集群的工作并行化，并且比在单个计算机上进行任何预处理更有效地扩展。

For JSON (or even XML), I don't think you need a custom input format. 对于JSON（甚至XML），我认为您不需要自定义输入格式。 Since PySpark executes within a Python environment, you can use functions regularly available to you in Python to deserialize the JSON and extract the fields you want. 由于PySpark在Python环境中执行，因此您可以使用Python中定期提供的函数来反序列化JSON并提取所需的字段。

For example: 例如：

import json

raw = sc.textFile("/path/to/file.json")
deserialized = raw.map(lambda x: json.loads(x))
desired_fields = deserialized.map(lambda x: x['key1'])

desired_fields is now an RDD of all the values under key1 in the original JSON file. desired_fields现在是原始JSON文件中key1下所有值的RDD。

You can use this pattern to extract a combination of fields, split them by whitespace, or whatever. 您可以使用此模式来提取字段的组合，按空格将其拆分，或进行其他操作。

desired_fields = deserialized.map(lambda x: (x['key1'] + x['key2']).split(' '))

And if this gets too complicated, you can replace the lambda with a regular Python function that does all the preprocessing you want and just call deserialized.map(my_preprocessing_func) . 而且，如果这变得太复杂，则可以用常规的Python函数替换lambda ，该函数执行所需的所有预处理，而只需调用deserialized.map(my_preprocessing_func) 。

Answer 2

Yes, you can create an RDD from a python variable using SparkContext.parallelize() : 是的，您可以使用SparkContext.parallelize()从python变量创建RDD：

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
distData.count()   # 5

This variable can also be an iterator. 此变量也可以是迭代器。

Spark和Python使用自定义文件格式/生成器作为RDD的输入

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-10-04 14:47:04

解决方案2
1 2014-10-03 11:50:01

Spark和Python使用自定义文件格式/生成器作为RDD的输入

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-10-04 14:47:04

解决方案2 1 2014-10-03 11:50:01

解决方案1
2 已采纳 2014-10-04 14:47:04

解决方案2
1 2014-10-03 11:50:01