简体   繁体   English

Python Spark实现映射减少算法以创建(列,值)元组

[英]Python Spark implementing map-reduce algorithm to create (column, value) tuples

UPDATE(04/20/17) : I am using Apache Spark 2.1.0 and I will be using Python. UPDATE(04/20/17) :我正在使用Apache Spark 2.1.0,并且我将使用Python。

I have narrowed down the problem and hopefully someone more knowledgeable with Spark can answer. 我已经缩小了问题的范围,希望对Spark有更多了解的人可以回答。 I need to create an RDD of tuples from the header of the values.csv file: 我需要从values.csv文件的标题创建元组的RDD:

values.csv (main collected data, very large): values.csv (主要收集的数据,非常大):

+--------+---+---+---+---+---+----+
|   ID   | 1 | 2 | 3 | 4 | 9 | 11 |
+--------+---+---+---+---+---+----+
|        |   |   |   |   |   |    |
| abc123 | 1 | 2 | 3 | 1 | 0 | 1  |
|        |   |   |   |   |   |    |
| aewe23 | 4 | 5 | 6 | 1 | 0 | 2  |
|        |   |   |   |   |   |    |
| ad2123 | 7 | 8 | 9 | 1 | 0 | 3  |
+--------+---+---+---+---+---+----+

output (RDD) : 输出(RDD)

+----------+----------+----------+----------+----------+----------+----------+
| abc123   | (1;1)    | (2;2)    | (3;3)    | (4;1)    | (9;0)    | (11;1)   |
|          |          |          |          |          |          |          |
| aewe23   | (1;4)    | (2;5)    | (3;6)    | (4;1)    | (9;0)    | (11;2)   |
|          |          |          |          |          |          |          |
| ad2123   | (1;7)    | (2;8)    | (3;9)    | (4;1)    | (9;0)    | (11;3)   |
+----------+----------+----------+----------+----------+----------+----------+

What happened was I paired each value with the column name of that value in the format: 发生了什么事,我将每个值与该值的列名配对,格式为:

(column_number, value)

raw format (if you are interested in working with it): 原始格式(如果您有兴趣使用它):

id,1,2,3,4,9,11
abc123,1,2,3,1,0,1
aewe23,4,5,6,1,0,2
ad2123,7,8,9,1,0,3

The Problem: 问题:

The example values.csv file contains only a few columns, but in the actual file there are thousands of columns. 示例values.csv文件仅包含几列,但在实际文件中有数千列。 I can extract the header and broadcast it to every node in the distributed environment, but I am not sure if that is the most efficient way to solve the problem. 我可以提取标头并将其广播到分布式环境中的每个节点,但是我不确定这是否是解决问题的最有效方法。 Is it possible to achieve the output with a parallelized header? 是否可以通过并行头实现输出?

I think you can achieve the solution using PySpark Dataframe too. 我认为您也可以使用PySpark Dataframe实现解决方案。 However, my solution is not optimal yet. 但是,我的解决方案还不是最优的。 I use split to get the new column name and corresponding columns to do sum . 我使用split获得新的列名,并使用相应的列sum This depends on how large is your key_list . 这取决于您的key_list If it's too large, this might not work will because you have to load key_list on memory (using collect ). 如果太大,则可能无法正常工作,因为您必须将key_list加载到内存中(使用collect )。

import pandas as pd
import pyspark.sql.functions as func

# example data
values = spark.createDataFrame(pd.DataFrame([['abc123', 1, 2, 3, 1, 0, 1],
                                             ['aewe23', 4, 5, 6, 1, 0, 2],
                                             ['ad2123', 7, 8, 9, 1, 0, 3]], 
                                             columns=['id', '1', '2', '3','4','9','11']))
key_list = spark.createDataFrame(pd.DataFrame([['a', '1'],
                                               ['b','2;4'],
                                               ['c','3;9;11']], 
                                              columns=['key','cols']))
# use values = spark.read.csv(path_to_csv, header=True) for your data

key_list_df = key_list.select('key', func.split('cols', ';').alias('col'))
key_list_rdd = key_list_df.rdd.collect()
for row in key_list_rdd:
    values = values.withColumn(row.key, sum(values[c] for c in row.col if c in values.columns))
keys = [row.key for row in key_list_rdd]
output_df = values.select(keys)

Output 输出量

output_df.show(n=3)
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  3|  4|
|  4|  6|  8|
|  7|  9| 12|
+---+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM