Join data with filename in Spark / PySpark

Question

I'm reading in data from a number of S3 files in PySpark. The S3 keys contain the calendar date that the file was created and I'd like to do a join between the data and that date. Is there any way to do a join between the lines of data in files and filenames?

Answer 1

You can add a column to the dataframe that contains the file name, I use this to identify the source of each row after merging them later:

from pyspark.sql.functions import lit

filename = 'myawesomefile.csv'

df_new = df.withColumn('file_name', lit(filename))

Answer 2

Here's what I ended up doing:

I overwrote the LineRecordReader Hadoop class so that it included the filename in each line, then overwrote TextInputFormat to use my new LineRecordReader.

Then I loaded the file using the newAPIHadoopFile function.

Links:
LineRecordReader: http://tinyurl.com/linerecordreader
TextInputFormat: http://tinyurl.com/textinputformat
newAPIHadoopFile: http://tinyurl.com/newapihadoopfile

Join data with filename in Spark / PySpark

Question

2 answers

solution1
1 2015-11-10 18:33:46

solution2
0 ACCPTED 2015-11-11 22:24:38

Join data with filename in Spark / PySpark

Question

2 answers

solution1 1 2015-11-10 18:33:46

solution2 0 ACCPTED 2015-11-11 22:24:38

solution1
1 2015-11-10 18:33:46

solution2
0 ACCPTED 2015-11-11 22:24:38