Python Spark implementing map-reduce algorithm to create (column, value) tuples

Question

UPDATE(04/20/17) : I am using Apache Spark 2.1.0 and I will be using Python.

I have narrowed down the problem and hopefully someone more knowledgeable with Spark can answer. I need to create an RDD of tuples from the header of the values.csv file:

values.csv (main collected data, very large):

+--------+---+---+---+---+---+----+
|   ID   | 1 | 2 | 3 | 4 | 9 | 11 |
+--------+---+---+---+---+---+----+
|        |   |   |   |   |   |    |
| abc123 | 1 | 2 | 3 | 1 | 0 | 1  |
|        |   |   |   |   |   |    |
| aewe23 | 4 | 5 | 6 | 1 | 0 | 2  |
|        |   |   |   |   |   |    |
| ad2123 | 7 | 8 | 9 | 1 | 0 | 3  |
+--------+---+---+---+---+---+----+

output (RDD) :

+----------+----------+----------+----------+----------+----------+----------+
| abc123   | (1;1)    | (2;2)    | (3;3)    | (4;1)    | (9;0)    | (11;1)   |
|          |          |          |          |          |          |          |
| aewe23   | (1;4)    | (2;5)    | (3;6)    | (4;1)    | (9;0)    | (11;2)   |
|          |          |          |          |          |          |          |
| ad2123   | (1;7)    | (2;8)    | (3;9)    | (4;1)    | (9;0)    | (11;3)   |
+----------+----------+----------+----------+----------+----------+----------+

What happened was I paired each value with the column name of that value in the format:

(column_number, value)

raw format (if you are interested in working with it):

id,1,2,3,4,9,11
abc123,1,2,3,1,0,1
aewe23,4,5,6,1,0,2
ad2123,7,8,9,1,0,3

The Problem:

The example values.csv file contains only a few columns, but in the actual file there are thousands of columns. I can extract the header and broadcast it to every node in the distributed environment, but I am not sure if that is the most efficient way to solve the problem. Is it possible to achieve the output with a parallelized header?

Answer 1

I think you can achieve the solution using PySpark Dataframe too. However, my solution is not optimal yet. I use split to get the new column name and corresponding columns to do sum . This depends on how large is your key_list . If it's too large, this might not work will because you have to load key_list on memory (using collect ).

import pandas as pd
import pyspark.sql.functions as func

# example data
values = spark.createDataFrame(pd.DataFrame([['abc123', 1, 2, 3, 1, 0, 1],
                                             ['aewe23', 4, 5, 6, 1, 0, 2],
                                             ['ad2123', 7, 8, 9, 1, 0, 3]], 
                                             columns=['id', '1', '2', '3','4','9','11']))
key_list = spark.createDataFrame(pd.DataFrame([['a', '1'],
                                               ['b','2;4'],
                                               ['c','3;9;11']], 
                                              columns=['key','cols']))
# use values = spark.read.csv(path_to_csv, header=True) for your data

key_list_df = key_list.select('key', func.split('cols', ';').alias('col'))
key_list_rdd = key_list_df.rdd.collect()
for row in key_list_rdd:
    values = values.withColumn(row.key, sum(values[c] for c in row.col if c in values.columns))
keys = [row.key for row in key_list_rdd]
output_df = values.select(keys)

Output

output_df.show(n=3)
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  3|  4|
|  4|  6|  8|
|  7|  9| 12|
+---+---+---+

Python Spark implementing map-reduce algorithm to create (column, value) tuples

Question

1 answers

solution1
1 2017-04-19 02:38:24

Python Spark implementing map-reduce algorithm to create (column, value) tuples

Question

1 answers

solution1 1 2017-04-19 02:38:24

solution1
1 2017-04-19 02:38:24