简体   繁体   English

在Spark / PySpark中使用文件名联接数据

[英]Join data with filename in Spark / PySpark

I'm reading in data from a number of S3 files in PySpark. 我正在从PySpark中的许多S3文件中读取数据。 The S3 keys contain the calendar date that the file was created and I'd like to do a join between the data and that date. S3键包含创建文件的日历日期,我想在数据和该日期之间进行联接。 Is there any way to do a join between the lines of data in files and filenames? 有没有办法在文件和文件名中的数据行之间进行联接?

You can add a column to the dataframe that contains the file name, I use this to identify the source of each row after merging them later: 您可以将一列添加到包含文件名的数据框中,稍后将它们合并后,我将用它来标识每一行的源:

from pyspark.sql.functions import lit

filename = 'myawesomefile.csv'

df_new = df.withColumn('file_name', lit(filename))

Here's what I ended up doing: 我最终要做的是:

I overwrote the LineRecordReader Hadoop class so that it included the filename in each line, then overwrote TextInputFormat to use my new LineRecordReader. 我改写了LineRecordReader Hadoop类,以便它在每行中都包含文件名,然后改写了TextInputFormat以使用我的新LineRecordReader。

Then I loaded the file using the newAPIHadoopFile function. 然后,我使用newAPIHadoopFile函数加载了文件。

Links: 链接:
LineRecordReader: http://tinyurl.com/linerecordreader LineRecordReader: http ://tinyurl.com/linerecordreader
TextInputFormat: http://tinyurl.com/textinputformat TextInputFormat: http//tinyurl.com/textinputformat
newAPIHadoopFile: http://tinyurl.com/newapihadoopfile newAPIHadoopFile: http ://tinyurl.com/newapihadoopfile

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM