[英]Python Spark implementing map-reduce algorithm to create (column, value) tuples
UPDATE(04/20/17) : I am using Apache Spark 2.1.0 and I will be using Python. UPDATE(04/20/17) :我正在使用Apache Spark 2.1.0,并且我将使用Python。
I have narrowed down the problem and hopefully someone more knowledgeable with Spark can answer. 我已经缩小了问题的范围,希望对Spark有更多了解的人可以回答。 I need to create an RDD of tuples from the header of the values.csv file:
我需要从values.csv文件的标题创建元组的RDD:
values.csv (main collected data, very large): values.csv (主要收集的数据,非常大):
+--------+---+---+---+---+---+----+
| ID | 1 | 2 | 3 | 4 | 9 | 11 |
+--------+---+---+---+---+---+----+
| | | | | | | |
| abc123 | 1 | 2 | 3 | 1 | 0 | 1 |
| | | | | | | |
| aewe23 | 4 | 5 | 6 | 1 | 0 | 2 |
| | | | | | | |
| ad2123 | 7 | 8 | 9 | 1 | 0 | 3 |
+--------+---+---+---+---+---+----+
output (RDD) : 输出(RDD) :
+----------+----------+----------+----------+----------+----------+----------+
| abc123 | (1;1) | (2;2) | (3;3) | (4;1) | (9;0) | (11;1) |
| | | | | | | |
| aewe23 | (1;4) | (2;5) | (3;6) | (4;1) | (9;0) | (11;2) |
| | | | | | | |
| ad2123 | (1;7) | (2;8) | (3;9) | (4;1) | (9;0) | (11;3) |
+----------+----------+----------+----------+----------+----------+----------+
What happened was I paired each value with the column name of that value in the format: 发生了什么事,我将每个值与该值的列名配对,格式为:
(column_number, value)
raw format (if you are interested in working with it): 原始格式(如果您有兴趣使用它):
id,1,2,3,4,9,11
abc123,1,2,3,1,0,1
aewe23,4,5,6,1,0,2
ad2123,7,8,9,1,0,3
The Problem: 问题:
The example values.csv file contains only a few columns, but in the actual file there are thousands of columns. 示例values.csv文件仅包含几列,但在实际文件中有数千列。 I can extract the header and broadcast it to every node in the distributed environment, but I am not sure if that is the most efficient way to solve the problem.
我可以提取标头并将其广播到分布式环境中的每个节点,但是我不确定这是否是解决问题的最有效方法。 Is it possible to achieve the output with a parallelized header?
是否可以通过并行头实现输出?
I think you can achieve the solution using PySpark Dataframe too. 我认为您也可以使用PySpark Dataframe实现解决方案。 However, my solution is not optimal yet.
但是,我的解决方案还不是最优的。 I use
split
to get the new column name and corresponding columns to do sum
. 我使用
split
获得新的列名,并使用相应的列sum
。 This depends on how large is your key_list
. 这取决于您的
key_list
。 If it's too large, this might not work will because you have to load key_list
on memory (using collect
). 如果太大,则可能无法正常工作,因为您必须将
key_list
加载到内存中(使用collect
)。
import pandas as pd
import pyspark.sql.functions as func
# example data
values = spark.createDataFrame(pd.DataFrame([['abc123', 1, 2, 3, 1, 0, 1],
['aewe23', 4, 5, 6, 1, 0, 2],
['ad2123', 7, 8, 9, 1, 0, 3]],
columns=['id', '1', '2', '3','4','9','11']))
key_list = spark.createDataFrame(pd.DataFrame([['a', '1'],
['b','2;4'],
['c','3;9;11']],
columns=['key','cols']))
# use values = spark.read.csv(path_to_csv, header=True) for your data
key_list_df = key_list.select('key', func.split('cols', ';').alias('col'))
key_list_rdd = key_list_df.rdd.collect()
for row in key_list_rdd:
values = values.withColumn(row.key, sum(values[c] for c in row.col if c in values.columns))
keys = [row.key for row in key_list_rdd]
output_df = values.select(keys)
Output 输出量
output_df.show(n=3)
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 4|
| 4| 6| 8|
| 7| 9| 12|
+---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.