I want to split the file according to the first character of the line in Pyspark DataFrame.
The original data has a column, the data includes
the file name (such as 'DATE20191009')
the file content (such as '1', '2', '3')
Input Sample File(Pyspark DataFrame):
column1
Date20191009
1
2
3
Date20191010
1
4
5
I want to get a Pyspark DataFrame with the file name as a split of the data.
The file name is placed in the column1 of the DataFrame, and the contents of the file are placed in the column2 of the DataFrame.
Expected Output(Pyspark DataFrame)
column1 column2
Date20191009 [1,2,3]
Date20191010 [1,4,5]
I tried the Pandas Dataframe and Pyspark DataFrame.collect() , but both failed due to excessive data volume (more than 9 million rows).
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import *
>>> w = Window.rowsBetween(Window.unboundedPreceding, 0)
#Input DataFrame
>>> df.show()
+------------+
| column1|
+------------+
|Date20191009|
| 1|
| 2|
| 3|
|Date20191010|
| 1|
| 4|
| 5|
+------------+
>>> df1 = df.withColumn('tmp', when(df.column1.startswith('Date'), df.column1).otherwise(None)).withColumn('temp', last('tmp', True).over(w)).drop('tmp')
>>> df1.show()
+------------+------------+
| column1| temp|
+------------+------------+
|Date20191009|Date20191009|
| 1|Date20191009|
| 2|Date20191009|
| 3|Date20191009|
|Date20191010|Date20191010|
| 1|Date20191010|
| 4|Date20191010|
| 5|Date20191010|
+------------+------------+
>>> df1.filter(df1.column1 != df1.temp).groupBy(df1.temp).agg(concat_ws(',',collect_list(df1.column1)).alias('column2')).withColumnRenamed("temp", "column1").show()
+------------+-------+
| column1|column2|
+------------+-------+
|Date20191009| 1,2,3|
|Date20191010| 1,4,5|
+------------+-------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.